{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Duplicates: Find Near Duplicate Data Samples\n",
    "\n",
    "Discover near duplicate data samples, which might indicate that there is not enough variation in the dataset.\n",
    "\n",
    "Please see the [doc page](https://apple.github.io/dnikit/introspectors/data_introspection/duplicates.html) on near duplicate discovery and removal for a description of the problem, the\n",
    "[Duplicates](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.Duplicates) introspector,\n",
    "and what actions can be taken to improve the dataset.\n",
    "\n",
    "For a more detailed guide on using all of these DNIKit components, try the [Familiarity for Rare Data Discovery Notebook](familiarity_for_rare_data_discovery.ipynb).\n",
    "\n",
    "## 1. Setup\n",
    "\n",
    "Here, the required imports are grouped together, and then the desired paths are set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "\n",
    "from dnikit.base import pipeline, ImageFormat\n",
    "from dnikit.processors import Cacher, ImageResizer, Pooler\n",
    "from dnikit_tensorflow import TFDatasetExamples, TFModelExamples\n",
    "\n",
    "# For future protection, any deprecated DNIKit features will be treated as errors\n",
    "from dnikit.exceptions import enable_deprecation_warnings\n",
    "enable_deprecation_warnings()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Optional: Download MobileNet and CIFAR-10\n",
    "This example uses [MobileNet](https://keras.io/api/applications/mobilenet/) (trained on ImageNet) and [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html), but feel free to use any other model and dataset. This notebook also uses [TFModelExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFModelExamples) and [TFDatasetExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFDatasetExamples) to load in MobileNet and CIFAR-10. Please see the docs for information about how to [load a model](https://apple.github.io/dnikit/how_to/connect_model.html) or [dataset](https://apple.github.io/dnikit/how_to/connect_data.html). [This doc page](https://apple.github.io/dnikit/how_to/connect_model.html) also describes how responses can be collected outside of DNIKit, and passed into Familiarity via a [Producer](https://apple.github.io/dnikit/api/dnikit/base.html#dnikit.base.Producer).\n",
    "\n",
    "**No processing is done at this point!** Batches are only pulled from the Producer and through the pipeline when the Dataset Report \"introspects\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Load CIFAR10 dataset\n",
    "# :: Note: max_samples makes it only go up to that many data samples. Remove to run on entire dataset.\n",
    "cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)\n",
    "\n",
    "# Create a subset of the CIFAR dataset that produces only automobiles\n",
    "cifar10_cars = cifar10.subset(datasets=['train'], labels=['automobile'], max_samples=1000)\n",
    "\n",
    "# Load the MobileNet model\n",
    "# :: Note: If loading a model from tensorflow,\n",
    "# :: see dnikit_tensorflow's load_tf_model_from_path method.\n",
    "mobilenet = TFModelExamples.MobileNet()\n",
    "mobilenet_preprocessor = mobilenet.preprocessing\n",
    "assert mobilenet_preprocessor is not None\n",
    "\n",
    "requested_response = 'conv_pw_13'\n",
    "\n",
    "# Create a pipeline, feeding CIFAR data into MobileNet,\n",
    "# and observing responses from layer conv_pw_13\n",
    "producer = pipeline(\n",
    "    # Pull from the CIFAR dataset subset that was created earlier,\n",
    "    # so that only 1000 samples are analyzed in the \"automobile\" class here\n",
    "    cifar10_cars,\n",
    "    \n",
    "    # Preprocess the image batches in the manner expected by MobileNet\n",
    "    mobilenet_preprocessor,\n",
    "\n",
    "    # Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer\n",
    "    ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),\n",
    "\n",
    "    # Run inference with MobileNet and extract intermediate embeddings\n",
    "    # (this time, just `conv_pw_13`, but other layers can be added)\n",
    "    # :: Note: This auto-detects the input layer and connects up 'images' to it:\n",
    "    mobilenet.model(requested_responses=[requested_response]),\n",
    "\n",
    "    # Max pool the responses before DNIKit processing using a DNIKit Pooler\n",
    "    Pooler(dim=(1, 2), method=Pooler.Method.MAX),\n",
    "\n",
    "    # Cache responses\n",
    "    Cacher()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### (Optional) Perform Dimension Reduction\n",
    "\n",
    "Although the [Duplicates](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.Duplicates) algorithm can run data with any number of dimensions, there will be improved performance and near identical results if the number of dimensions is reduced. The DNIKit [DimensionReduction](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.DimensionReduction) introspector can be used for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.introspectors import DimensionReduction\n",
    "\n",
    "dimension_reduction = DimensionReduction.introspect(\n",
    "    producer, strategies=DimensionReduction.Strategy.PCA(40))\n",
    "reduced_producer = pipeline(\n",
    "    producer,\n",
    "    dimension_reduction,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 2. Look for duplication in embedding space using Duplicates introspector\n",
    "\n",
    "In this step, use the Duplicates introspector to build the clusters of duplicates.  This can consume\n",
    "a `producer` or `reduced_producer` with no other arguments and give good results.\n",
    "\n",
    "In some cases, it might be necessary to refine the results.  The first way to do this is by modifying the `threshold` parameter to the `introspect()` function, of type `Duplicates.ThresholdStrategy`. For using the `percentile` or\n",
    "`close_sensitivity` parameters to the `introspect()` call.  By default the introspector will use a\n",
    "dynamic method to find the bend in the sorted distances.  The sensitivity can be controlled with\n",
    "`close_sensitivity` -- the default value is 5.  Setting it to 2 will give more duplicates and a sensitivity of\n",
    "20 will produce fewer.  The `percentile` will use the given percentile in the distance as the \"close\"\n",
    "threshold.  For example a `percentile` value of `99.5` would take the 99.5th percentile distance and consider that\n",
    "\"close\" when building thresholds.  A lower value would produce more duplicates.\n",
    "\n",
    "The other way to refine the results is to sort or filter the clusters.  They can be sorted by the\n",
    "mean distance to the centroid of the cluster to see the tightest clusters first.  For example, filter out\n",
    "clusters larger than `N` to avoid seeing any very large clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.introspectors import Duplicates\n",
    "\n",
    "duplicates = Duplicates.introspect(reduced_producer)\n",
    "\n",
    "print(f\"Found {len(duplicates.results[requested_response])} unique clusters of duplicates.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 3. Manual analysis of near duplicates - Images\n",
    "\n",
    "Now that sets of the most similar images have been computed, visually inspect them and note which data to consider removing before training. All data displayed are unique.\n",
    "\n",
    "As mentioned in the introduction, it's not always bad to have similar samples in the dataset. It is bad practice to have duplicates across training and testing datasets, and it's important to consider the *effective* training/testing dataset size (including class breakdowns) really is. Remember that slight variation of a data sample for training can be achieved with data augmentation and applied uniformly across all training samples instead of just to a few specific samples as part of the stored dataset.\n",
    "\n",
    "\n",
    "In general, it's possible to evaluate the following criteria with the sets of near duplicates:\n",
    "\n",
    "1. **Set contains train + test images:** Are there near identical data samples that span the train and test sets for that class?\n",
    "2. **Set contains only train images:** For near duplicates in the training set, is the network learning anything more by including all near-duplicates? Could some be replaced with good data augmentation methods?\n",
    "3. **Set contains only test images:** For near duplicates in the test set, is there anything new learned about the model if it is able to classify both all the near-duplicate data samples? Does this skew the reported accuracy?\n",
    "\n",
    "Notice that in the following results, some of the images in the same set look very different (e.g. Cluster 8), but in other lines (e.g. Cluster 1, 2, 3, etc.), the images look nearly identical. This is why it's important to perform some manual inspection of the data before removing samples.\n",
    "\n",
    "Even if these images do not appear similar to the human eye, they are closest in the current embedding space, for the chosen layer response name. This can still provide useful information about what the network is using to distinguish between data samples, and it's recommended to look for what unites the images, and if desired, collect more data to help the network distinguish between the uniting feature.\n",
    "\n",
    "**Anecdote:** In this notebook, a small dataset was chosen for efficiency purposes. However, duplicate analysis has been run on all of CIFAR-10, across train and test datasets, and an extremely large number of duplicates has been found. CIFAR-10 is a widely used dataset for training and evaluation purposes, but the fact that there are so many duplicates calls into question the validity of the accuracy scores reported on this dataset for different models, especially with respect to model generalization. Though this is not the first time someone has noticed this issue in the dataset, still, this dataset is widely used in the ML community. It's a lesson for all of us to always look at the data.\n",
    "\n",
    "For now, grab all the images into a data structure in order to index into them and grab the data to visualize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "images = [\n",
    "    element.fields['samples']\n",
    "    for b in cifar10_cars(batch_size=64)\n",
    "    for element in b.elements\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from dnikit.base import Batch\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "clusters = duplicates.results[requested_response]\n",
    "sorted_clusters = sorted(clusters, key=lambda x: x.mean)\n",
    "\n",
    "for cluster_number, cluster in enumerate(sorted_clusters):\n",
    "    print(f'Cluster {cluster_number + 1}, mean={cluster.mean}')\n",
    "\n",
    "    img_idx_list = cluster.batch.metadata[Batch.StdKeys.IDENTIFIER]\n",
    "\n",
    "    f, axarr = plt.subplots(1, len(img_idx_list), squeeze=False, figsize=(15,6))\n",
    "    for i, img_idx in enumerate(img_idx_list):\n",
    "        assert isinstance(img_idx, int), \"For this example plot, the identifier should be an int\"\n",
    "        axarr[0, i].imshow(images[img_idx])\n",
    "        plt.setp(axarr[0, i].spines.values(), lw=2)\n",
    "        axarr[0, i].yaxis.set_major_locator(plt.NullLocator())\n",
    "        axarr[0, i].xaxis.set_major_locator(plt.NullLocator())\n",
    "        axarr[0, i].set_title(f'Data Index {img_idx}')\n",
    "    plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}