{ "cells": [ { "cell_type": "markdown", "id": "a6f25846", "metadata": {}, "source": [ "# Dataset Report: CIFAR10 Example\n", "\n", "Here is an example of how to run the [DatasetReport](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.DatasetReport) on the dataset. This notebook will demonstrate how to build the data for the report using DNIKit. An example of how to visualize the report output with the Symphony UI framework can be found in their [documentation](https://github.com/apple/ml-symphony).\n", "\n", "For a deeper understanding of the Dataset Report, please see the [doc page](https://apple.github.io/dnikit/introspectors/data_introspection/dataset_report.html).\n", "\n", "Before proceeding, please review the \"How to Use\" section in the docs, starting with the [How to Load A Model](https://apple.github.io/dnikit/how_to/connect_model.html)." ] }, { "cell_type": "markdown", "id": "59bd89e3", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Report: (1) Setup\n", "\n", "Here, group together required imports and set desired paths." ] }, { "cell_type": "code", "execution_count": null, "id": "42d7245d", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import logging\n", "logging.basicConfig(level=logging.ERROR)" ] }, { "cell_type": "code", "execution_count": null, "id": "03d54847", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dataclasses import dataclass\n", "import os\n", "from pathlib import Path\n", "import typing as t\n", "\n", "# OpenCV will be used to interact with the CIFAR10 dataset, since it contains images.\n", "import cv2\n", "import numpy as np\n", "\n", "# This notebook pieces together a pipeline to run a dataset through a model, with pre- and post- processing,\n", "# and then feeds it into the Dataset Report for full analysis\n", "from dnikit.introspectors import DatasetReport, ReportConfig\n", "from dnikit.base import Batch, Producer, pipeline, ImageFormat\n", "from dnikit.processors import Cacher, FieldRenamer, ImageResizer, Pooler, Processor\n", "from dnikit_tensorflow import load_tf_model_from_path\n", "\n", "# For future protection, any deprecated DNIKit features will be treated as errors\n", "from dnikit.exceptions import enable_deprecation_warnings\n", "enable_deprecation_warnings()\n", "\n", "# Use a pre-trained Keras MobileNet model to analyze the CIFAR10 dataset\n", "import keras\n", "from keras.applications.mobilenet_v2 import preprocess_input as mobilenet_preprocessing\n", "from keras.datasets import cifar10" ] }, { "cell_type": "code", "execution_count": null, "id": "107a4d1e", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "data_path = \"./cifar/\"" ] }, { "cell_type": "markdown", "id": "6b32ffcd", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Dataset Report: (1) Setup - Download Model\n", "\n", "Download a [MobileNet](https://keras.io/api/applications/mobilenet/) model from keras that has been pre-trained on the ImageNet dataset, which is similar to the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset used. [TFModelExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFModelExamples) are used to load MobileNet, but any model can be loaded, as described [here](https://apple.github.io/dnikit/how_to/connect_model.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "b8ad8231", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from dnikit_tensorflow import TFModelExamples\n", "\n", "mobilenet = TFModelExamples.MobileNet()\n", "mobilenet_preprocessor = mobilenet.preprocessing\n", "assert mobilenet_preprocessor is not None" ] }, { "cell_type": "markdown", "id": "e7286abe", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Report: (2) DNIKit Producer\n", "\n", "This is the chunk of the work that will change for new datasets. This step teaches DNIKit how to load data as batches, by creating a custom [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer) for the CIFAR-10 dataset.\n", "\n", "[TFDatasetExamples.CIFAR10](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFDatasetExamples.CIFAR10) could be used to instantiate the data producer, but for this example, the full extent of creating a custom DNIKit [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer) is shown.\n", "\n", "DNIKit operates on datasets in batches, so that it can handle large-scale datasets without loading everything into memory at once. For each batch, metadata can be attached to provide a more thorough exploration in the report. [Batch.StdKeys](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Batch.StdKeys) metadata keys will be used to attach:\n", "\n", "- Identifier: unique identifier for each data sample (in this case, the path to the file)\n", "- Label:\n", " - Class label: airplane, automobile, etc. class label\n", " - Dataset label: train vs. test\n", "\n", "**Note**: The following `Cifar10Producer` is more complicated than the typical custom [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer). The data is in a couple different numpy arrays, but in order to have the option of exporting the report and sharing the zip of files, the raw data must be turned into image files. If there were already files on disk, it would be simpler.\n", "\n", "DNIKit has a couple built-in Producers: [ImageProducer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.ImageProducer) that load directly from files, and [TorchProducer](https://apple.github.io/dnikit/api/torch/index.html#dnikit_torch.TorchProducer) which is a Producer created directly from a PyTorch dataset.\n", "\n", "Follow the comments in the following `Cifar10Producer` code block to see how to create a custom [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer) for a dataset, and refer to the doc on [loading data](https://apple.github.io/dnikit/how_to/connect_data.html) for more information." ] }, { "cell_type": "code", "execution_count": null, "id": "7f7f4b1b", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# First, download data\n", "(x_train, y_train), (x_test, y_test) = cifar10.load_data()\n", "class_to_name = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", "\n", "# Concatenate the train and test into one array, as well as the train/test labels, and the class labels\n", "full_dataset = np.concatenate((x_train, x_test))\n", "dataset_labels = ['train']*len(x_train) + ['test']*len(x_test)\n", "class_labels = np.squeeze(np.concatenate((y_train, y_test)))" ] }, { "cell_type": "code", "execution_count": null, "id": "dfcca56c", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "@dataclass\n", "class Cifar10Producer(Producer):\n", " dataset: np.ndarray\n", " \"\"\"All raw image data to load into the Dataset Report\"\"\"\n", "\n", " dataset_labels: t.Sequence[str]\n", " \"\"\"Labels to distinguish what dataset the data came from (e.g. train / test)\"\"\"\n", "\n", " class_labels: t.Sequence[str]\n", " \"\"\"Labels to distinguish class\"\"\"\n", "\n", " data_path: str\n", " \"\"\"Where data will be written to be packaged up with Dataset Report\"\"\"\n", "\n", " max_data: int = -1\n", " \"\"\"Max data samples to pull from. This is helpful for local debugging.\"\"\"\n", "\n", " def __post_init__(self) -> None:\n", " if self.max_data <=0:\n", " self.max_data = len(self.dataset)\n", "\n", " def _class_path(self, index: int) -> str:\n", " return f\"{self.dataset_labels[index]}/{class_to_name[int(self.class_labels[index])]}\"\n", "\n", " def _write_images_to_disk(self, ii: int, jj: int) -> None:\n", " for idx in range(ii, jj):\n", " base_path = os.path.join(self.data_path, self._class_path(idx))\n", " Path(base_path).mkdir(exist_ok=True, parents=True)\n", " filename = os.path.join(base_path, f\"image{idx}.png\")\n", " # Write to disk after converting to BGR format, used by opencv\n", " cv2.imwrite(filename, cv2.cvtColor(self.dataset[idx, ...], cv2.COLOR_RGB2BGR))\n", "\n", " def __call__(self, batch_size: int) -> t.Iterable[Batch]:\n", " \"\"\"The important function... yield a batch of data from the downloaded dataset\"\"\"\n", "\n", " # Iteratively loop over the data samples and yield it in batches\n", " for ii in range(0, self.max_data, batch_size):\n", " jj = min(ii+batch_size, self.max_data)\n", "\n", " # Optional step, write data locally since it was loaded from Keras\n", " self._write_images_to_disk(ii, jj)\n", "\n", " # Create batch from data already in memory\n", " builder = Batch.Builder(\n", " fields={\"images\": self.dataset[ii:jj, ...]}\n", " )\n", "\n", " # Use pathname as the identifier for each data sample, excluding base data directory\n", " builder.metadata[Batch.StdKeys.IDENTIFIER] = [\n", " os.path.join(self._class_path(idx), f\"image{idx}.png\")\n", " for idx in range(ii, jj)\n", " ]\n", " # Add class and dataset labels\n", " builder.metadata[Batch.StdKeys.LABELS] = {\n", " \"class\": [class_to_name[int(lbl_idx)] for lbl_idx in self.class_labels[ii:jj]],\n", " \"dataset\": self.dataset_labels[ii:jj]\n", " }\n", "\n", " yield builder.make_batch()" ] }, { "cell_type": "code", "execution_count": null, "id": "ea0f41ee", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Now instantiate the producer from the loaded CIFAR-10 data\n", "cifar10_producer = Cifar10Producer(\n", " dataset=full_dataset,\n", " dataset_labels=dataset_labels,\n", " class_labels=class_labels,\n", " data_path=data_path,\n", "\n", " # This \"max data\" param is purely for running this notebook quickly in the DNIKit docs\n", " # Remove this param to run on the whole dataset\n", " max_data=1000\n", ")" ] }, { "cell_type": "markdown", "id": "55cb47b0", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Report: (3) Model Inference w/ Pre + Post Processing\n", "\n", "First load the saved TF Keras model into dnikit using [load_tf_model_from_path](https://apple.github.io/dnikit/api/tensorflow/index.html#dnikit_tensorflow.load_tf_model_from_path).\n", "\n", "Then, apply pre and post processing steps around model inference. This consists of the following steps:\n", "\n", "- mobilenet preprocessing: Keras has its own preprocessing for MobileNet, this function is turned into a DNIKit [Processor](https://apple.github.io/dnikit/api/dnikit/processors.html#dnikit.processors.Processor) so it can be chained together with other pre / post processing stages\n", "- resize images to fit the input of MobileNet, (224, 224) using an [ImageResizer](https://apple.github.io/dnikit/api/dnikit/processors.html#dnikit.processors.ImageResizer)\n", "- rename the data, which have been stored under \"images\", to match the input layer of MobileNet. To learn about how to read input and output layers from a loaded dnikit model, please read through the [Dataset Errors and Rare Samples example notebook](familiarity_for_rare_data_discovery.ipynb).\n", "- run inference and extract intermediate embeddings (this time, just `conv_pw_13`, other layers can be added, e.g. what is found when inspecting them from the `dnikit_model`. Again, please see the [Dataset Errors and Rare Samples example notebook](familiarity_for_rare_data_discovery.ipynb) for more of a guide on this piece.\n", "- max pool the responses before DNIKit processing using a DNIKit [Pooler](https://apple.github.io/dnikit/api/dnikit/processors.html#dnikit.processors.Pooler)" ] }, { "cell_type": "code", "execution_count": null, "id": "70e56ac5", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Chain together all operations around running the data through the model\n", "model_stages = (\n", " mobilenet_preprocessor,\n", " \n", " ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),\n", " \n", " # Run inference with MobileNet and extract intermediate embeddings\n", " # (this time, just `conv_pw_130`, but other layers can be added)\n", " # :: Note: This auto-detects the input layer and connects up 'images' to it:\n", " mobilenet.model(requested_responses=['conv_pw_13']),\n", " \n", " Pooler(dim=(1, 2), method=Pooler.Method.MAX)\n", ")" ] }, { "cell_type": "markdown", "id": "83213dc6", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Report: (4) Chain together pipeline stages\n", "\n", "Create the DNIKit [pipeline](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.pipeline) from the base CIFAR-10 Producer that was written earlier, and then by unwrapping the tuple of model-related [PipelineStages](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.PipelineStage) defined in the prior cell.\n", "\n", "No processing is done at this point, but will be called when the Dataset Report \"introspects\"." ] }, { "cell_type": "code", "execution_count": null, "id": "d93cec27", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Finally put it all together!\n", "producer = pipeline(\n", " # Original data producer that will yield batches\n", " cifar10_producer,\n", "\n", " # unwrap the tuple of pipeline stages that contain model inference, and pre/post-processing\n", " *model_stages,\n", "\n", " # Cache responses to play around with data in future cells\n", " Cacher()\n", ")" ] }, { "cell_type": "markdown", "id": "dc83dd1a", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Report: (5) Run Dataset Report. Introspect!\n", "\n", "All compute is performed in this step by pulling batches through the entire pipeline.\n", "\n", "[DatasetReport](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.DatasetReport) [introspect](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.DatasetReport.introspect) takes the data through the pipeline and gives us a report object. This object contains `data`, which is a pandas dataframe table with metadata about each data sample like familiarity, duplicates, overall summary, and projection that can be passed to [Symphony](https://github.com/apple/ml-symphony).\n", "\n", "A [ReportConfig](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.ReportConfig) is passed as input in the next cell that specifies that to not run the projection or familiarity components. This is simply for speed of this example notebook. To run all components, simply omit the config parameter from the `DatasetReport.introspect` function.\n", "\n", "Displayed in the next cell are the first five rows of the resulting `report.data` that is interpretable by the Symphony UI." ] }, { "cell_type": "code", "execution_count": null, "id": "0acfd945", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# The most time consuming, since all compute is done here\n", "# Data passed through DNIKit in batches to produce the backend data table that will be displayed by Symphony\n", "\n", "no_familiarity_config = ReportConfig(\n", " familiarity=None,\n", ")\n", "\n", "report = DatasetReport.introspect(\n", " producer,\n", " config=no_familiarity_config # Comment this out to run the whole Dataset Report\n", ")\n", "\n", "print(report.data.head())" ] }, { "cell_type": "markdown", "id": "4a109302", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Report: (6) Visualization\n", "\n", "Remember that this example operates on a subset of the dataset in this example. For interesting findings, and for the comprehensive report, omit:\n", "\n", "- `max_data` param from `Cifar10Producer` to run on all 60,000 images\n", "- `config` param for `DatasetReport.introspect` to build all components of the Dataset Report\n", "\n", "To visualize the results, the resulting ``report`` can be fed into the [Symphony UI](https://github.com/apple/ml-symphony). To save ``report`` to disk, run:\n", "\n", "```\n", "report.to_disk('./report_files/')\n", "```\n", "where ``./report_files`` can be any path. This Pandas DataFrame can then be loaded and fed into Symphony." ] } ], "metadata": { "finalized": { "timestamp": 1638994625349, "trusted": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.8" } }, "nbformat": 4, "nbformat_minor": 5 }