{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# DNIKit Familiarity: Dataset Distribution\n", "\n", "Compare the distributions of two datasets, e.g. train/test datasets, synthetic/real datasets, etc.\n", "\n", "Please see the [doc page](https://apple.github.io/dnikit/introspectors/data_introspection/familiarity.html#use-case-comparing-dataset-distributions) for a discussion on applying [Familiarity](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.introspectors.Familiarity) to dataset distribution analysis, including what actions can be taken to improve the dataset.\n", "\n", "For a more detailed guide on using all of these DNIKit components, try the [Familiarity for Rare Data Discovery](familiarity_for_rare_data_discovery.ipynb) Notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Don't run this cell if stochasticity is desired\n", "import numpy as np\n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Optional: Download MobileNet and CIFAR-10\n", "This example uses [MobileNet](https://keras.io/api/applications/mobilenet/) (trained on ImageNet) and [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html), but feel free to use any other model and dataset. This notebook uses [TFModelExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFModelExamples) and [TFDatasetExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFDatasetExamples) to load in MobileNet and CIFAR-10. Please see the DNIKit docs for information about how to [load a model](https://apple.github.io/dnikit/how_to/connect_model.html) or [dataset](https://apple.github.io/dnikit/how_to/connect_data.html). [This page](https://apple.github.io/dnikit/how_to/connect_model.html) also describes how responses can be collected outside of DNIKit, and passed into Familiarity via a [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "##########################\n", "# User-Defined Variables #\n", "##########################\n", "\n", "# Change the following labels to see which labels are more familiar.\n", "# The example illustrates a comparison between the distributions of the train\n", "# and test sets for automobiles, for 100 images.\n", "TRAIN_CLASS_LABEL = 'automobile'\n", "TEST_CLASS_LABEL = 'automobile'\n", "N_SAMPLES = 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.processors import ImageResizer, SnapshotSaver\n", "from dnikit.base import Batch, PixelFormat, pipeline, ImageFormat\n", "from dnikit_tensorflow import TFDatasetExamples, TFModelExamples\n", "\n", "# Load CIFAR10 dataset and feed into MobileNet,\n", "# observing responses from layer conv_pw_13\n", "mobilenet = TFModelExamples.MobileNet()\n", "mobilenet_preprocessor = mobilenet.preprocessing\n", "assert mobilenet_preprocessor is not None\n", "\n", "# Load CIFAR-10 with train and test datasets, and\n", "# attach metadata (labels, dataset origins, image filepaths) to each batch\n", "cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)\n", "\n", "# Create pre-processing pipeline\n", "preprocessing_stages = (\n", " # Save a snapshot of the raw image data to refer back to later\n", " SnapshotSaver(),\n", "\n", " # Preprocess the image batches in the manner expected by MobileNet\n", " mobilenet_preprocessor,\n", " \n", " # Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer\n", " ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),\n", ")\n", "\n", "# Create producers for subsets of the dataset for comparing train / test distribution\n", "# :: Note: The subset method will filter the batch LABELS metadata matching the provided dict\n", "data_producers = {\n", " 'train': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=[\"train\"], max_samples=N_SAMPLES),\n", " 'test': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=[\"test\"], max_samples=N_SAMPLES),\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Put it all together to produce familiarity scores\n", "For a more detailed breakdown of these steps, see the [Familiarity for Rare Data Discovery](familiarity_for_rare_data_discovery.ipynb) Notebook.\n", "\n", "### A. Define user variables\n", "First define some user variables, which can be modified to play around with different classes, or different datasets." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### B. Create producers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.processors import Cacher, Pooler\n", "\n", "producers = {\n", " split: pipeline(\n", " data_producers[split],\n", " \n", " # Apply previously-defined preprocessing stages for Mobilenet & CIFAR\n", " *preprocessing_stages,\n", "\n", " # run inference -- pass a list of requested responses or a single string\n", " mobilenet.model('conv_pw_13'),\n", "\n", " # perform spatial max pooling on the result\n", " Pooler(dim=(1, 2), method=Pooler.Method.MAX),\n", "\n", " # Cache results to re-run the pipeline later without recomputing the responses\n", " Cacher()\n", " )\n", " for split in ('train', 'test')\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Reduce dimensionality of responses" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.introspectors import DimensionReduction\n", "\n", "# Configure the DimensionReduction Introspector\n", "# The dimensionality of the data will be reduced from 1024 to 40\n", "n_dim = 40\n", "\n", "# Trigger the pipeline & fit the PCA model on the train dataset, which will used as the base\n", "pca = DimensionReduction.introspect(producers[\"train\"], strategies=DimensionReduction.Strategy.PCA(n_dim))\n", "\n", "# Apply the PipelineStage pca object to both train/test pipelines to reduce responses in all batches to a lower dimension\n", "reduced_producers = {\n", " name: pipeline(producer, pca)\n", " for name, producer in producers.items()\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Build Familiarity model on train & test data combined" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.introspectors import Familiarity\n", "\n", "# The Familiarity model is first fit on the base dataset, which is \"train\" in this case\n", "# Trigger pipeline & run DNIKit Familiarity, default strategy is Familiarity.Strategy.GMM\n", "familiarity = Familiarity.introspect(reduced_producers['train'])\n", "\n", "# Use dict-comprehension to apply familiarity to the train and test datasets individually\n", "scored_producers = {\n", " producer_name : pipeline(\n", " cached_response_producer,\n", " familiarity\n", " )\n", " # reduced_producers maps 'train'/'test' to the split's reduced producer\n", " for producer_name, cached_response_producer in reduced_producers.items()\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Compute familiarity likelihood score\n", "\n", "Produce the final familiarity likelihood score. \n", "\n", "- If the likelihood score is close to 0, both distributions are equivalent.\n", "- Typically, the train dataset's mean log score will be smaller than the test dataset's, since familiarity was fit to this first/train dataset. The more negative the overall likelihood score is, the larger the distribution gap. One of the datasets is likely in need of being re-collected.\n", "- It may still happen that the likelihood score is greater than 0. This is also explained by a distribution gap, and will require analysis and possibly data re-collection.\n", "\n", "Please refer to the [doc page](https://apple.github.io/dnikit/introspectors/data_introspection/familiarity.html#use-case-comparing-dataset-distributions) for more information, and check out the other Familiarity use case, [discovering rare samples](https://apple.github.io/dnikit/introspectors/data_introspection/familiarity.html#use-case-finding-dataset-errors), or the [DatasetReport](https://apple.github.io/dnikit/introspectors/data_introspection/dataset_report.html) to evaluate why there is a distribution gap." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.base import Producer\n", "\n", "def compute_score_mean(producer: Producer, response_name: str, meta_key: Batch.DictMetaKey) -> float:\n", " \"\"\" Compute mean of score, for given metadata key, response name, and producer \"\"\"\n", " scores = [\n", " batch.metadata[meta_key][response_name][index].score\n", " for batch in producer(32)\n", " for index in range(batch.batch_size)\n", " ]\n", " return np.mean(scores)\n", "\n", "# Trigger remaining pipeline, compute mean of familiarity scores for both train and test datasets\n", "stats = {\n", " producer_name : compute_score_mean(\n", " producer=producer,\n", " response_name=\"conv_pw_13\",\n", " meta_key=familiarity.meta_key\n", " )\n", " # scored_producers maps 'train'/'test' to the split's scored producer\n", " for producer_name, producer in scored_producers.items()\n", "}\n", "\n", "familiarity_ratio = stats['test'] - stats['train']\n", "print(f\"Likelihood ratio [{TRAIN_CLASS_LABEL}]->[{TEST_CLASS_LABEL}] = {familiarity_ratio:0.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.8" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }