{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# DNIKit Familiarity: Dataset Distribution\n",
    "\n",
    "Compare the distributions of two datasets, e.g. train/test datasets, synthetic/real datasets, etc.\n",
    "\n",
    "Please see the [doc page](https://apple.github.io/dnikit/introspectors/data_introspection/familiarity.html#use-case-comparing-dataset-distributions) for a discussion on applying [Familiarity](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.introspectors.Familiarity) to dataset distribution analysis, including what actions can be taken to improve the dataset.\n",
    "\n",
    "For a more detailed guide on using all of these DNIKit components, try the [Familiarity for Rare Data Discovery](familiarity_for_rare_data_discovery.ipynb) Notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Don't run this cell if stochasticity is desired\n",
    "import numpy as np\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Optional: Download MobileNet and CIFAR-10\n",
    "This example uses [MobileNet](https://keras.io/api/applications/mobilenet/) (trained on ImageNet) and [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html), but feel free to use any other model and dataset. This notebook uses [TFModelExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFModelExamples) and [TFDatasetExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFDatasetExamples) to load in MobileNet and CIFAR-10. Please see the DNIKit docs for information about how to [load a model](https://apple.github.io/dnikit/how_to/connect_model.html) or [dataset](https://apple.github.io/dnikit/how_to/connect_data.html). [This page](https://apple.github.io/dnikit/how_to/connect_model.html) also describes how responses can be collected outside of DNIKit, and passed into Familiarity via a [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "##########################\n",
    "# User-Defined Variables #\n",
    "##########################\n",
    "\n",
    "# Change the following labels to see which labels are more familiar.\n",
    "#   The example illustrates a comparison between the distributions of the train\n",
    "#   and test sets for automobiles, for 100 images.\n",
    "TRAIN_CLASS_LABEL = 'automobile'\n",
    "TEST_CLASS_LABEL = 'automobile'\n",
    "N_SAMPLES = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.processors import ImageResizer, SnapshotSaver\n",
    "from dnikit.base import Batch, PixelFormat, pipeline, ImageFormat\n",
    "from dnikit_tensorflow import TFDatasetExamples, TFModelExamples\n",
    "\n",
    "# Load CIFAR10 dataset and feed into MobileNet,\n",
    "# observing responses from layer conv_pw_13\n",
    "mobilenet = TFModelExamples.MobileNet()\n",
    "mobilenet_preprocessor = mobilenet.preprocessing\n",
    "assert mobilenet_preprocessor is not None\n",
    "\n",
    "# Load CIFAR-10 with train and test datasets, and\n",
    "# attach metadata (labels, dataset origins, image filepaths) to each batch\n",
    "cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)\n",
    "\n",
    "# Create pre-processing pipeline\n",
    "preprocessing_stages = (\n",
    "    # Save a snapshot of the raw image data to refer back to later\n",
    "    SnapshotSaver(),\n",
    "\n",
    "    # Preprocess the image batches in the manner expected by MobileNet\n",
    "    mobilenet_preprocessor,\n",
    "    \n",
    "    # Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer\n",
    "    ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),\n",
    ")\n",
    "\n",
    "# Create producers for subsets of the dataset for comparing train / test distribution\n",
    "# :: Note: The subset method will filter the batch LABELS metadata matching the provided dict\n",
    "data_producers = {\n",
    "    'train': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=[\"train\"], max_samples=N_SAMPLES),\n",
    "    'test': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=[\"test\"], max_samples=N_SAMPLES),\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Put it all together to produce familiarity scores\n",
    "For a more detailed breakdown of these steps, see the [Familiarity for Rare Data Discovery](familiarity_for_rare_data_discovery.ipynb) Notebook.\n",
    "\n",
    "### A. Define user variables\n",
    "First define some user variables, which can be modified to play around with different classes, or different datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### B. Create producers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.processors import Cacher, Pooler\n",
    "\n",
    "producers = {\n",
    "    split: pipeline(\n",
    "        data_producers[split],\n",
    "        \n",
    "        # Apply previously-defined preprocessing stages for Mobilenet & CIFAR\n",
    "        *preprocessing_stages,\n",
    "\n",
    "        # run inference -- pass a list of requested responses or a single string\n",
    "        mobilenet.model('conv_pw_13'),\n",
    "\n",
    "        # perform spatial max pooling on the result\n",
    "        Pooler(dim=(1, 2), method=Pooler.Method.MAX),\n",
    "\n",
    "        # Cache results to re-run the pipeline later without recomputing the responses\n",
    "        Cacher()\n",
    "    )\n",
    "    for split in ('train', 'test')\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Reduce dimensionality of responses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.introspectors import DimensionReduction\n",
    "\n",
    "# Configure the DimensionReduction Introspector\n",
    "#    The dimensionality of the data will be reduced from 1024 to 40\n",
    "n_dim = 40\n",
    "\n",
    "# Trigger the pipeline & fit the PCA model on the train dataset, which will used as the base\n",
    "pca = DimensionReduction.introspect(producers[\"train\"], strategies=DimensionReduction.Strategy.PCA(n_dim))\n",
    "\n",
    "# Apply the PipelineStage pca object to both train/test pipelines to reduce responses in all batches to a lower dimension\n",
    "reduced_producers = {\n",
    "    name: pipeline(producer, pca)\n",
    "    for name, producer in producers.items()\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Build Familiarity model on train & test data combined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.introspectors import Familiarity\n",
    "\n",
    "# The Familiarity model is first fit on the base dataset, which is \"train\" in this case\n",
    "#   Trigger pipeline & run DNIKit Familiarity, default strategy is Familiarity.Strategy.GMM\n",
    "familiarity = Familiarity.introspect(reduced_producers['train'])\n",
    "\n",
    "# Use dict-comprehension to apply familiarity to the train and test datasets individually\n",
    "scored_producers = {\n",
    "    producer_name : pipeline(\n",
    "        cached_response_producer,\n",
    "        familiarity\n",
    "    )\n",
    "    # reduced_producers maps 'train'/'test' to the split's reduced producer\n",
    "    for producer_name, cached_response_producer in reduced_producers.items()\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Compute familiarity likelihood score\n",
    "\n",
    "Produce the final familiarity likelihood score. \n",
    "\n",
    "- If the likelihood score is close to 0, both distributions are equivalent.\n",
    "- Typically, the train dataset's mean log score will be smaller than the test dataset's, since familiarity was fit to this first/train dataset. The more negative the overall likelihood score is, the larger the distribution gap. One of the datasets is likely in need of being re-collected.\n",
    "- It may still happen that the likelihood score is greater than 0. This is also explained by a distribution gap, and will require analysis and possibly data re-collection.\n",
    "\n",
    "Please refer to the [doc page](https://apple.github.io/dnikit/introspectors/data_introspection/familiarity.html#use-case-comparing-dataset-distributions) for more information, and check out the other Familiarity use case, [discovering rare samples](https://apple.github.io/dnikit/introspectors/data_introspection/familiarity.html#use-case-finding-dataset-errors), or the [DatasetReport](https://apple.github.io/dnikit/introspectors/data_introspection/dataset_report.html) to evaluate why there is a distribution gap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.base import Producer\n",
    "\n",
    "def compute_score_mean(producer: Producer, response_name: str, meta_key: Batch.DictMetaKey) -> float:\n",
    "    \"\"\" Compute mean of score, for given metadata key, response name, and producer \"\"\"\n",
    "    scores = [\n",
    "        batch.metadata[meta_key][response_name][index].score\n",
    "        for batch in producer(32)\n",
    "        for index in range(batch.batch_size)\n",
    "    ]\n",
    "    return np.mean(scores)\n",
    "\n",
    "# Trigger remaining pipeline, compute mean of familiarity scores for both train and test datasets\n",
    "stats = {\n",
    "    producer_name : compute_score_mean(\n",
    "        producer=producer,\n",
    "        response_name=\"conv_pw_13\",\n",
    "        meta_key=familiarity.meta_key\n",
    "    )\n",
    "    # scored_producers maps 'train'/'test' to the split's scored producer\n",
    "    for producer_name, producer in scored_producers.items()\n",
    "}\n",
    "\n",
    "familiarity_ratio = stats['test'] - stats['train']\n",
    "print(f\"Likelihood ratio [{TRAIN_CLASS_LABEL}]->[{TEST_CLASS_LABEL}] = {familiarity_ratio:0.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.8"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}