{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# DNIKit Familiarity: Dataset Errors and Rare Samples\n", "\n", "Using DNIKit [Familiarity](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.Familiarity), discover the least and most representative data samples in an overall dataset or within a data subgroup in order to understand dataset bias, spot outliers, and find dataset errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.exceptions import enable_deprecation_warnings\n", "\n", "enable_deprecation_warnings(error=True) # treat DNIKit deprecation warnings as errors" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Set-up a DNIKit model that is able to run inference\n", "\n", "Please see the docs for a breakdown of how to [load a model](https://apple.github.io/dnikit/how_to/connect_model.html) using DNIKit. This notebook uses DNIKit's [TFModelExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFModelExamples) to load the MobileNet model directly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dnikit_tensorflow import TFModelExamples\n", "\n", "# Load CIFAR10 dataset and feed into MobileNet,\n", "# observing responses from layer conv_pw_13\n", "mobilenet = TFModelExamples.MobileNet()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Create a DNI Dataset wrapping CIFAR-10\n", "Download the CIFAR-10 dataset and create 2 sub-datasets to illustrate the familiarity concept.\n", "\n", "* **datasets[\"cars\"]:** Will contain 300 images of automobiles from the training set.\n", "* **datasets[\"mixed\"]:** Will contain 300 images randomly sampled across all classes in the test set.\n", "\n", "Wrap the data into a DNIKit [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer) in order to use CIFAR-10 in DNIKit. This example uses DNIKit's [TFDatasetExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFDatasetExamples) to grab a built-in CIFAR-10 Producer. For defining a custom dataset, please see the docs for more information about how to [load data](https://apple.github.io/dnikit/how_to/connect_data.html) into DNIKit. This might include implementing a custom [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer) or using another built-in Producer (e.g. [ImageProducer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.ImageProducer) to load images from disk).\n", "\n", "Moreover, MobileNet accepts images of size `224x224`, so it's necessary to pre-process the CIFAR-10 images by resizing them from `32x32` to `224x224`. DNIKit provides a set of processors in the module `dnikit.processors`. Here, [ImageResizer](https://apple.github.io/dnikit/api/dnikit/processors.html#dnikit.processors.ImageResizer) is used in a [pipeline](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.pipeline) that applies such pre-processing on top of CIFAR-10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dnikit.processors import ImageResizer, SnapshotSaver\n", "from dnikit.base import Batch, PixelFormat, pipeline, ImageFormat\n", "from dnikit_tensorflow import TFDatasetExamples\n", "\n", "# Load CIFAR-10 with train and test datasets, and\n", "# attach metadata (labels, dataset origins, image filepaths) to each batch\n", "cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)\n", "\n", "mobilenet_preprocessor = mobilenet.preprocessing\n", "assert mobilenet_preprocessor is not None\n", "\n", "# Create pre-processing pipeline\n", "preprocessing_stages = (\n", " # Save a snapshot of the raw image data to refer back to later\n", " SnapshotSaver(),\n", "\n", " # Preprocess the image batches in the manner expected by MobileNet\n", " mobilenet_preprocessor,\n", " \n", " # Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer\n", " ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),\n", ")\n", "\n", "# Create producers for subsets of the dataset\n", "datasets = dict()\n", "\n", "# Create producer with just 300 samples of automobiles\n", "# :: Note: The subset method will filter the batch LABELS metadata matching the provided dict\n", "datasets[\"cars\"] = cifar10.subset(labels=[\"automobile\"], datasets=[\"train\"], max_samples=300)\n", "\n", "# Create producer with 300 random samples from all other classes in the test set\n", "# :: Note: Pull from the test set to ensure no overlap with the \"cars\" subset\n", "non_car_labels = [name for name in cifar10.str_to_label_idx() if name != \"automobile\"]\n", "datasets[\"mixed\"] = cifar10.subset(labels=non_car_labels, datasets=[\"test\"], max_samples=300)\n", "datasets[\"mixed\"].shuffle() # shuffle data samples" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 2. Run inference to obtain responses\n", "\n", "This is a very important step in DNIKit. Define how responses are being pulled from the model as data is being fed. Also define how responses are post-processed so they have the dimensionality that suits the task.\n", "\n", "For that, a dictionary of [Producers](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer) is created from the dataset plus a processing/inference pipeline. In this example, a [Pooler](https://apple.github.io/dnikit/api/dnikit/processors.html#dnikit.processors.Pooler) is used to post-process the data.\n", "\n", "*For more details about this step please check [the PFA Basic CNN Example Notebook](../model_introspection/principal_filter_analysis.ipynb), which contains a graphical explanation of this step.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.processors import Pooler\n", "\n", "# convert the datasets (t.Mapping[str, Producer]) into a new t.Mapping[str, Producer]\n", "# describing the processing pipeline. Remember, the Producer is a promise for data,\n", "# not the result of processing.\n", "producers = {}\n", "for dataset_name, dataset in datasets.items():\n", " producers[dataset_name] = pipeline(\n", " # Source of data -- this is either datasets[\"cars\"] or datasets[\"mixed\"]\n", " dataset,\n", "\n", " # Unwrap preprocessing into the pipeline\n", " *preprocessing_stages,\n", "\n", " # Run inference -- pass a list of requested responses or a single string\n", " mobilenet.model('conv_pw_13'),\n", "\n", " # Perform spatial max pooling on the result\n", " # For max pooling, use `method=Pooler.Method.MAX`,\n", " Pooler(dim=(1, 2), method=Pooler.Method.AVERAGE),\n", " )" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### 2.1 Reduce the dimensionality using Principal Component Analysis (PCA)\n", "\n", "The responses created earlier will be of dimension 1024 per image. This seems too high for CIFAR images. Remember that the MobileNet model used was trained for Imagenet.\n", "\n", "DNIKit contains a [DimensionReduction](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.introspectors.DimensionReduction) Introspector that can help us reduce the dimensionality of the model's responses. In this case, use the [PCA](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.introspectors.DimensionReduction.Strategy.PCA) strategy to fit a PCA model on the automobile data, and then use it in a [pipeline](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.pipeline) to project all the responses to a length of 40." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.introspectors import DimensionReduction\n", "\n", "# Configure the DimensionReduction Introspector that will be used to reduce the dimensionality of the data from 1024 to 40\n", "n_dim = 40\n", "\n", "# fit the PCA model on the cars\n", "print(f'Fitting PCA to {n_dim} dimensions, might take some seconds or minutes ...')\n", "pca = DimensionReduction.introspect(producers[\"cars\"], strategies=DimensionReduction.Strategy.PCA(n_dim))\n", "print('Done.')\n", "\n", "# the PCA object is another `PipelineStage` that can be used to transform batches to\n", "# lower dimension data. Convert the producers (t.Mapping[str, Producer]) into\n", "# a new dictionary with each of those Producers dimension-reduced via the pca object.\n", "reduced_producers = {\n", " # compose producers[name] with pca\n", " name: pipeline(producer, pca)\n", " for name, producer in producers.items()\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 3. Fit familiarity model\n", "\n", "Fit a gaussian mixture model in high dimensional space (obtained after PCA) on the 🚗 dataset. The model will capture the dense areas of the dataset, called *familiar*.\n", "\n", "Using the fitted model, *score* the images of a dataset as being familiar or not to cars 🚗: call this `fam(data | 🚗)`.\n", "\n", "* `fam(🚗 | 🚗)`: The most common cars in the dataset are expected to have high familiarity. The cars with low familiarity are those that are not well represented in the dataset (typically corner cases, strange viewpoints, etc).\n", "\n", "* `fam(mixed | 🚗)`: The mixed dataset will likely have a few cars, expected to have high familiarity. Also the images of the \"truck\" class will probably have high familiarity with cars. On the other hand, images of forest and animals will probably have lower familiarity.\n", "\n", "Familiarity can be used to spot corner cases and wrong annotations quickly!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit.introspectors import Familiarity\n", "\n", "# Fit the familiarity model on the dimension reduced automobile 🚗 class of images\n", "print('Fitting Familiarity, might take some seconds or minutes ...')\n", "\n", "# Use the Familiarity.Strategy.GMM strategy by default\n", "familiarity = Familiarity.introspect(reduced_producers[\"cars\"])\n", "\n", "# the familiarity object is something that can score responses (per layer) according to the model that was just\n", "# trained. it puts these scores (density) into metadata attached to the batch. a higher score means\n", "# more familiar and a lower score means less familiar.\n", "\n", "print('Done fitting Familiarity.')\n", "\n", "# Define how to score the 🚗 and mixed datasets according to the 🚗 familiarity model.\n", "#\n", "# now score the cars and mixed (dimension reduced) datasets with respect to\n", "# the familiarity model that was just computed. Use reduced_producers (t.Mapping[str, Producer])\n", "# and compose it with familiarity to produce a new scored_producers.\n", "scored_producers = {\n", " # compose reduced_producers[name] with familiarity\n", " name: pipeline(producer, familiarity)\n", " for name, producer in reduced_producers.items()\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 4. Visualization\n", "\n", "Once the density scores are computed, there are different ways to visualize them. One way would be to use the [DatasetReport](https://apple.github.io/dnikit/introspectors/data_introspection/dataset_report.html)'s recommended way: with [Symphony](https://github.com/apple/ml-symphony).\n", "\n", "For this notebook, define a simple class to hold the image data and the density score, that can bee sorted and displayed using matplotlib." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import dataclasses\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import typing as t\n", "\n", "from dnikit.base import Producer\n", "\n", "@dataclasses.dataclass(frozen=True, order=True)\n", "class FamiliarityResult:\n", " \"\"\"Class to encapsulate the familiarity data and image.\"\"\"\n", " density_score: float\n", " img: np.ndarray = dataclasses.field(compare=False)\n", "\n", " def plot(self, index: int, axarr: plt.Axes, color: str = 'b') -> None:\n", " \"\"\"Add the image and score to the axes of a plot\"\"\"\n", " axarr[0, index].imshow(self.img)\n", " plt.setp(axarr[0, index].spines.values(), color=color)\n", " axarr[0, index].yaxis.set_major_locator(plt.NullLocator())\n", " axarr[0, index].xaxis.set_major_locator(plt.NullLocator())\n", " axarr[0, index].set_title(f'{self.density_score:0.3f}')\n", "\n", " @staticmethod\n", " def build(producer: Producer) -> t.Sequence[\"FamiliarityResult\"]:\n", " \"\"\"Construct a list of FamiliarityResult from a Producer\"\"\"\n", " return [\n", " FamiliarityResult(\n", " # pull the original raster data from the input attached to the snapshot\n", " img=batch.snapshots[\"snapshot\"].fields[\"samples\"][index],\n", "\n", " # and the scores are attached to the Familiarity.DENSITY_KEY metadata on the output field\n", " density_score=batch.metadata[familiarity.meta_key][\"conv_pw_13\"][index].score\n", " )\n", " for batch in producer(10)\n", " for index in range(batch.batch_size)\n", " ]\n", "\n", "# produced sorted results in descending order -- note that this is actually evaluating the producers\n", "# in the call to build()\n", "print('Scoring and sorting data. This may take several seconds...')\n", "sorted_automobile = sorted(FamiliarityResult.build(scored_producers[\"cars\"]), reverse=True)\n", "sorted_mixed = sorted(FamiliarityResult.build(scored_producers[\"mixed\"]), reverse=True)\n", "print('Done.')" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Show the top-10 and bottom-10 familiar images with respect to the 🚗 familiarity model.\n", "\n", "* See how the least familiar cars are \"less canonical\" than the most familiar ones. This may indicate that more cars like these ones should be collected to enforce diversity.\n", "* Check how the most familiar images from the mixed dataset are mostly cars and trucks. On the other hand, the leas familiar ones are mostly animals.\n", "* There is a car in the least familiar mixed images. However, that car has a tilted viewpoint, which makes it less familiar." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "def show_results(results: t.Sequence[FamiliarityResult], title: str = '', color: str = 'b') -> None:\n", " print(title)\n", " _, axarr = plt.subplots(1, len(results), squeeze=False, figsize=(15,6))\n", " for index, result in enumerate(results):\n", " result.plot(index, axarr, color)\n", " plt.show()\n", "\n", "print('==================\\nfam(🚗 | 🚗)\\n==================\\n')\n", "show_results(sorted_automobile[:10], title='Automobile most familiar', color='green')\n", "show_results(sorted_automobile[-10:], title='Automobile least familiar', color='red')\n", "\n", "print('==================\\nfam(mixed | 🚗)\\n==================\\n')\n", "show_results(sorted_mixed[:10], title='Mixed most familiar', color='green')\n", "show_results(sorted_mixed[-10:], title='Mixed least familiar', color='red')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.8" } }, "nbformat": 4, "nbformat_minor": 2 }