{ "cells": [ { "cell_type": "markdown", "id": "fc6de30c", "metadata": {}, "source": [ "# DimensionReduction: Visualize Embedding Space" ] }, { "cell_type": "markdown", "id": "fc37b224", "metadata": {}, "source": [ "DNIKit provides a [DimensionReduction](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.DimensionReduction) introspector with a variety of strategies (dimension reduction algorithms). ``DimensionReduction`` has two primary uses:\n", "\n", "- reduce high dimension data to something lower for consumption by a different introspector\n", " - for example [Familiarity](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.Familiarity) works best on low dimension data, e.g. 40 dimensions\n", " - many embedding spaces are much higher dimension, e.g. 1024\n", " - in fact some of the ``DimensionReduction`` algorithms work best on lower dimension data\n", " - this is typically accomplished with using the [PCA](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.DimensionReduction.Strategy.PCA) strategy first\n", "- reduction to 2D or 3D for visualization\n", " - there are several strategies that do a good job\n", "\n", "This notebook shows examples of both of these: reducing high dimension data to lower dimension and producing 2D projections for visualization.\n", "\n", "The examples here use CIFAR-10 (image classification dataset with 10 classes) and `mobilenet_v2` to produce an embedding space." ] }, { "cell_type": "markdown", "id": "92b4467f", "metadata": {}, "source": [ "First, import all the packages needed:" ] }, { "cell_type": "code", "execution_count": null, "id": "2f324033", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import numpy as np\n", "import typing as t\n", "import dataclasses\n", "\n", "# dnikit / dimension reduction\n", "from dnikit.base import (\n", " Batch, Producer, pipeline,\n", " ImageFormat\n", ")\n", "from dnikit.processors import (\n", " Processor,\n", " ImageResizer,\n", " Cacher,\n", " Pooler,\n", ")\n", "from dnikit.introspectors import DimensionReduction\n", "\n", "# for inference\n", "import logging\n", "from dnikit_tensorflow import TFModelExamples, TFDatasetExamples\n", "\n", "# graphing\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "1f6079c4", "metadata": {}, "source": [ "## Dataset + Model to Produce Embeddings\n", "\n", "Next load the CIFAR10 dataset and MobileNet to produce the embedding using [TFModelExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFModelExamples) and [TFDatasetExamples](https://apple.github.io/dnikit/api/dnikit_tensorflow/index.html#dnikit_tensorflow.TFDatasetExamples). See [load a model](https://apple.github.io/dnikit/how_to/connect_model.html) or [load data](https://apple.github.io/dnikit/how_to/connect_data.html) for more information about how to customize this step to specific needs." ] }, { "cell_type": "code", "execution_count": null, "id": "5c7acd2d", "metadata": {}, "outputs": [], "source": [ "logging.basicConfig(level=logging.INFO)\n", "\n", "# Load MobileNet and CIFAR10\n", "cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)\n", "cifar10_test = cifar10.subset(datasets=[\"test\"])\n", "\n", "mobilenet = TFModelExamples.MobileNet()" ] }, { "cell_type": "markdown", "id": "b979f80a", "metadata": {}, "source": [ "Define the inference pipeline -- it will produce a field named `'conv_pw_13'` that has the embedding to visualize. This is a 1024-dimension vector to be shown in 2 dimensions." ] }, { "cell_type": "code", "execution_count": null, "id": "03506fc8", "metadata": {}, "outputs": [], "source": [ "mobilenet_preprocessor = mobilenet.preprocessing\n", "assert mobilenet_preprocessor is not None\n", "\n", "preprocessing_stages = (\n", "\n", " # First, grab MobileNet's preprocessing acts as a DNIKit Processor\n", " mobilenet_preprocessor,\n", "\n", " # Next, define the second processor for the data, - ImageResizer to 224x224.\n", " ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),\n", ")\n", "\n", "producer = pipeline(\n", " # Use the cifar10 test dataset\n", " cifar10_test,\n", "\n", " # Apply previously-defined preprocessing stages for Mobilenet & CIFAR\n", " *preprocessing_stages,\n", "\n", " # run inference -- pass a list of requested responses or a single string\n", " mobilenet.model(requested_responses='conv_pw_13'),\n", "\n", " # perform spatial max pooling on the result\n", " Pooler(dim=(1, 2), method=Pooler.Method.MAX),\n", "\n", " # Cache results to re-run the pipeline without recomputing the responses\n", " Cacher()\n", ")\n", "\n", "BATCH_SIZE = 128" ] }, { "cell_type": "markdown", "id": "474ac931", "metadata": {}, "source": [ "## Visualization\n", "\n", "First a helper function that will extract the 2d projection and build a graph with one scatter plot per class (type of image, e.g. car or frog) -- this allows us to assign a different color per class." ] }, { "cell_type": "code", "execution_count": null, "id": "4b574819", "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "\n", "def show(producer: Producer) -> None:\n", " data: t.Dict[t.Hashable, t.Any] = defaultdict(list)\n", " for batch in producer(1000):\n", " for e in batch.elements:\n", " f = e.fields[\"conv_pw_13\"]\n", " c = e.metadata[Batch.StdKeys.LABELS][\"label\"]\n", " data[c] = np.append(data[c], f)\n", "\n", " fig, ax = plt.subplots(dpi=200)\n", " for c, d in data.items():\n", " ax.scatter(d[::2], d[1::2], c=f'C{c}', s=2)" ] }, { "cell_type": "markdown", "id": "99d7f15e", "metadata": {}, "source": [ "## Dimension Reduction\n", "\n", "Dimension reduction is a two-step process. First, the [DimensionReduction.introspect()](https://apple.github.io/dnikit/api/dnikit/introspectors.html#dnikit.introspectors.DimensionReduction.introspect) call is made to observe the contents of a pipeline and fit a model to the data. This produces a ``reducer`` that can be used to reduce the data:\n", "\n", "```\n", "reducer = DimensionReduction.introspect(producer, strategies=DimensionReduction.Strategy.PCA(40))\n", "```\n", "\n", "This means that all of the algorithms require visiting the data twice: once to fit and again to do the projection. All but PCA require accumulating the entire dataset in memory in order to produce the fit. Because the data is read twice, it's suggested to ensure that the pipeline being used is either very fast or use a [Cacher](https://apple.github.io/dnikit/api/dnikit/processors.html#dnikit.processors.Cacher) to cache the results of the inference.\n", "\n", "In the case of PCA, the reducer can be applied to any other (hopefully related) pipeline. The other algorithms must apply to the same data that they were fit to.\n", "\n", "```\n", "# apply the projection to the original 1024-dimension data -- reduced_producer\n", "# produces batches with 40 dimensions.\n", "reduced_producer = pipeline(producer, reducer)\n", "```" ] }, { "cell_type": "markdown", "id": "721551c4", "metadata": {}, "source": [ "### PCA\n", "\n", "Principal Component Analysis can be used to do dimension reduction to any number of target dimensions:\n", "\n", "> PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns n components in its fit method, and can be used on new data to project it on these components.\n", "\n", "PCA is fast and convenient to use, but it doesn't always give the best result for reducing data to 2D for visualization. Per the UMAP documentation:\n", "\n", "> PCA is linear transformation that can be applied to mostly any kind of data in an unsupervised fashion. Also it works really fast. For most real world tasks its embeddings are mostly too simplistic / useless.\n", "\n", "- https://scikit-learn.org/stable/modules/decomposition.html#pca\n", "\n", "**Pros**\n", "\n", "- can be used in a streaming fashion (does not require full dataset to be collected in memory to do the fit)\n", "- suitable for reducing e.g. 1024 -> 40 dimensions for use in other algorithms\n", " - many dimension reduction and other algorithms explicitly suggest this technique\n", " - it removes some of the \"noise\" of very high dimensions\n", "- very fast\n", "\n", "**Cons**\n", "\n", "- does not produce a very useful 2D projection" ] }, { "cell_type": "code", "execution_count": null, "id": "78bd494a", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# introspect the producer (see the preceding pipeline) and produce\n", "# a pipeline stage that can do a 2d projection.\n", "pca = DimensionReduction.introspect(\n", " producer, batch_size=BATCH_SIZE,\n", " strategies=DimensionReduction.Strategy.PCA(2)\n", ")\n", "\n", "# apply the 2d projection to the original 1024-dimension data\n", "pca_reduced = pipeline(producer, pca)\n", "\n", "# and display it\n", "show(pca_reduced)" ] }, { "cell_type": "markdown", "id": "98c92cce", "metadata": {}, "source": [ "#### Partial Reduction for Other Algorithms\n", "\n", "As mentioned previously, many algorithms work better with lower dimension inputs, e.g. 40 dimensions, and dimension reduction algorithms (other than PCA) are no exception. This has the added benefit of reducing the memory used when accumulating the data for the fit (first pass).\n", "\n", "- https://umap-learn.readthedocs.io/en/0.5dev/faq.html\n", "\n", "Prepare a PCA reducer that produces 40 dimension output and a pipeline that encapsulates that." ] }, { "cell_type": "code", "execution_count": null, "id": "0f341736", "metadata": {}, "outputs": [], "source": [ "partial_reducer = DimensionReduction.introspect(\n", " producer, batch_size=BATCH_SIZE,\n", " strategies=DimensionReduction.Strategy.PCA(40))\n", "partially_reduced = pipeline(producer, partial_reducer)" ] }, { "cell_type": "markdown", "id": "d5e7fa19", "metadata": {}, "source": [ "### t-SNE\n", "\n", "t-distributed Stochastic Neighbor Embedding (t-SNE) is a more sophisticated dimension reduction technique that tries to maintain local structure.\n", "\n", "> t-SNE (TSNE) converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student’s t-distributions. This allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:\n", "> - Revealing the structure at many scales on a single map\n", "> - Revealing data that lie in multiple, different, manifolds or clusters\n", "> - Reducing the tendency to crowd points together at the center\n", "\n", "- [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)\n", "- [SciKit Learn Description](https://scikit-learn.org/stable/modules/manifold.html#t-sne)\n", "\n", "\n", "It has a number of tuning parameters, the most important being `perplexity`:\n", "\n", "- [Parameter Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)\n", "- [Exploring Perplexity](https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py)\n", "\n", "Use the 40 dimensional projection defined earlier and use t-SNE to do the further reduction to 2D. This produces a much clearer separation of classes than PCA was able to obtain.\n", "\n", "**Pros**\n", "\n", "- better 2D projection than PCA\n", "\n", "**Cons**\n", "\n", "- slow\n", "- `UMAP` (shown later) is generally preferred" ] }, { "cell_type": "code", "execution_count": null, "id": "8327772a", "metadata": {}, "outputs": [], "source": [ "tsne = DimensionReduction.introspect(\n", " partially_reduced, batch_size=BATCH_SIZE, \n", " strategies=DimensionReduction.Strategy.TSNE(2))\n", "tsne_reduced = pipeline(partially_reduced, tsne)\n", "\n", "show(tsne_reduced)" ] }, { "cell_type": "markdown", "id": "27bb809c", "metadata": {}, "source": [ "### UMAP\n", "\n", "UMAP is another popular dimension reduction technique:\n", "\n", "> Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data\n", ">\n", "> - The data is uniformly distributed on Riemannian manifold;\n", "> - The Riemannian metric is locally constant (or can be approximated as such);\n", "> - The manifold is locally connected.\n", "\n", "- [Understanding UMAP](https://pair-code.github.io/understanding-umap/)\n", "- [Documentation](https://umap-learn.readthedocs.io/en/0.5dev/index.html) -- include tips on parameters and how to use it\n", "- [Paper](https://arxiv.org/abs/1802.03426)\n", "\n", "Shown in the next cell, UMAP provides an even greater separation of the embedding space with regard to class.\n", "\n", "**Pros**\n", "\n", "- produces very good 2D projections\n", "- local structure is preserved\n", "- faster than t-SNE but slower than PCA\n", "\n", "**Cons**\n", "\n", "- does not preserve global structure, see `PaCMAP` later" ] }, { "cell_type": "code", "execution_count": null, "id": "f108bbdd", "metadata": {}, "outputs": [], "source": [ "umap = DimensionReduction.introspect(\n", " partially_reduced, batch_size=BATCH_SIZE,\n", " strategies=DimensionReduction.Strategy.UMAP(2))\n", "\n", "umap_reduced = pipeline(partially_reduced, umap)\n", "\n", "show(umap_reduced)" ] }, { "cell_type": "markdown", "id": "6e6fc90b", "metadata": {}, "source": [ "### PaCMAP\n", "\n", "The final DimensionReduction algorithm integrated into DNIKit is called PaCMAP:\n", "\n", "> PaCMAP (Pairwise Controlled Manifold Approximation) is a dimensionality reduction method that can be used for visualization, preserving both local and global structure of the data in original space. PaCMAP optimizes the low dimensional embedding using three kinds of pairs of points: neighbor pairs (pair_neighbors), mid-near pair (pair_MN), and further pairs (pair_FP).\n", ">\n", "> Previous dimensionality reduction techniques focus on either local structure (e.g. t-SNE, LargeVis and UMAP) or global structure (e.g. TriMAP), but not both, although with carefully tuning the parameter in their algorithms that controls the balance between global and local structure, which mainly adjusts the number of considered neighbors. Instead of considering more neighbors to attract for preserving local structure, PaCMAP dynamically uses a special group of pairs -- mid-near pairs, to first capture global structure and then refine local structure, which both preserve global and local structure. For a thorough background and discussion on this work, please read [this paper](https://jmlr.org/papers/v22/20-1061.html).\n", "\n", "- https://github.com/YingfanWang/PaCMAP\n", "\n", "Much like UMAP, this gives a very strong separation of points in the embedding space with regard to class. This algorithm has a number of tuning parameters that can be used to control it -- please read their website (listed earlier) for more information.\n", "\n", "**Pros**\n", "\n", "- designed to keep both local and global structure" ] }, { "cell_type": "code", "execution_count": null, "id": "85318aec", "metadata": {}, "outputs": [], "source": [ "pacmap = DimensionReduction.introspect(\n", " partially_reduced, batch_size=BATCH_SIZE, \n", " strategies=DimensionReduction.Strategy.PaCMAP(2))\n", "\n", "pacmap_reduced = pipeline(partially_reduced, pacmap)\n", "\n", "show(pacmap_reduced)" ] }, { "cell_type": "code", "execution_count": null, "id": "9b319d7a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.8" } }, "nbformat": 4, "nbformat_minor": 5 }