{ "metadata": { "name": "09_validation_and_testing" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Measuring Classification Performance: Validation & Testing" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Checking Performance on the Iris Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Previously, we looked at a simplistic example of how to test the performance\n", "of a classifier. Using the iris data set, it looked something like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Get the data\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Instantiate and train the classifier\n", "from sklearn.svm import LinearSVC\n", "clf = LinearSVC(loss = 'l2')\n", "clf.fit(X, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Check input vs. output labels\n", "y_pred = clf.predict(X)\n", "print (y_pred == y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** what might be the problem with this approach?" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "A Better Approach: Cross-Validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learning the parameters of a prediction function and testing it on the\n", "same data is a methodological mistake: a model that would just repeat\n", "the labels of the samples that it has just seen would have a perfect\n", "score but would fail to predict anything useful on yet-unseen data.\n", "\n", "To avoid over-fitting, we have to define two different sets:\n", "\n", "- a training set X_train, y_train which is used for learning the parameters of a predictive model\n", "- a testing set X_test, y_test which is used for evaluating the fitted predictive model\n", "\n", "In scikit-learn such a random split can be quickly computed with the\n", "`train_test_split` helper function. It can be used this way:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import cross_validation\n", "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)\n", "\n", "print X.shape, X_train.shape, X_test.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we train on the training data, and test on the testing data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "clf = LinearSVC(loss='l2').fit(X_train, y_train)\n", "y_pred = clf.predict(X_test)\n", "print (y_pred == y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is an issue here, however:\n", "by defining these two sets, we drastically reduce the number\n", "of samples which can be used for learning the model, and the results\n", "can depend on a particular random choice for the pair of (train, test) sets.\n", "\n", "A solution is to split the whole data several consecutive times in different\n", "train set and test set, and to return the averaged value of the prediction\n", "scores obtained with the different sets. Such a procedure is called **cross-validation**.\n", "This approach can be computationally expensive, but does not waste too much data\n", "(as it is the case when fixing an arbitrary test set), which is a major advantage\n", "in problem such as inverse inference where the number of samples is very small.\n", "\n", "We'll explore cross-validation a bit later, but\n", "you can find much more information on cross-validation in scikit-learn here:\n", "http://scikit-learn.org/dev/modules/cross_validation.html\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Diving Deeper: Hyperparameters, Over-fitting, and Under-fitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The content in this section is adapted from Andrew Ng's excellent\n", "Coursera course, available here:* https://www.coursera.org/course/ml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The issues associated with validation and \n", "cross-validation are some of the most important\n", "aspects of the practice of machine learning. Selecting the optimal model\n", "for your data is vital, and is a piece of the problem that is not often\n", "appreciated by machine learning practitioners.\n", "\n", "Of core importance is the following question:\n", "\n", "**If our estimator is underperforming, how should we move forward?**\n", "\n", "- Use simpler or more complicated model?\n", "- Add more features to each observed data point?\n", "- Add more training samples?\n", "\n", "The answer is often counter-intuitive. In particular, **Sometimes using a\n", "more complicated model will give _worse_ results.** Also, **Sometimes adding\n", "training data will not improve your results.** The ability to determine\n", "what steps will improve your model is what separates the successful machine\n", "learning practitioners from the unsuccessful." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "A Simple Regression Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this section, we'll work with a simple 1D regression problem. This will help us to\n", "easily visualize the data and the model, and the results generalize easily to higher-dimensional\n", "datasets. We'll explore **polynomial regression**: the fitting of a polynomial to points.\n", "Though this can be accomplished within scikit-learn (the machinery is in `sklearn.linear_model`),\n", "for simplicity we'll use `numpy.polyfit` and `numpy.polyval`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "\n", "x = 10 * np.random.random(20)\n", "y = 0.5 * x ** 2 - x + 1\n", "\n", "p = np.polyfit(x, y, deg=2)\n", "print p" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, polyfit fits a polynomial to one-dimensional data. We can\n", "visualize this to see the result:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x_new = np.linspace(-1, 12, 1000)\n", "y_new = np.polyval(p, x_new)\n", "\n", "plt.scatter(x, y)\n", "plt.plot(x_new, y_new)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've chosen the model to use through the *hyperparameter* `deg`.\n", "\n", "A *hyperparameter* is a parameter that determines the type of\n", "model we use: for example, `deg=1` gives a linear model, `deg=2`\n", "gives a 2nd-order polynomial, etc." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Adding some noise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, what if the data is not a perfect polynomial? Below, we'll take the above\n", "problem and add a small\n", "amount of Gaussian scatter in ``y``. Here we'll take the additional step of computing\n", "the RMS error of the resulting model on the input data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.random.seed(42)\n", "x = 10 * np.random.random(20)\n", "y = 0.5 * x ** 2 - x + 1 + np.random.normal(0, 2, x.shape)\n", "\n", "# ---> Change the degree here\n", "p = np.polyfit(x, y, deg=2)\n", "x_new = np.linspace(0, 10, 100)\n", "y_new = np.polyval(p, x_new)\n", "\n", "plt.scatter(x, y)\n", "plt.plot(x_new, y_new)\n", "plt.ylim(-10, 50)\n", "print \"RMS error = %.4g\" % np.sqrt(np.mean((y - np.polyval(p, x)) ** 2))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What happens to the fit and the RMS error as the degree is increased?**" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Learning Curves and the Bias/Variance Tradeoff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to address this issue is to use what are often called **Learning Curves**.\n", "Given a particular dataset and a model we'd like to fit (e.g. a polynomial), we'd\n", "like to tune our value of the *hyperparameter* `d` to give us the best fit.\n", "\n", "We'll imagine we have a simple regression problem: given the size of a house, we'd\n", "like to predict how much it's worth. We'll fit it with our polynomial regression\n", "model.\n", "\n", "Run the following code to see an example plot:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from figures import plot_bias_variance\n", "plot_bias_variance(8)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above figure, we see fits for three different values of `d`.\n", "For `d = 1`, the data is under-fit. This means that the model is too\n", "simplistic: no straight line will ever be a good fit to this data. In\n", "this case, we say that the model suffers from high bias. The model\n", "itself is biased, and this will be reflected in the fact that the data\n", "is poorly fit. At the other extreme, for `d = 6` the data is over-fit.\n", "This means that the model has too many free parameters (6 in this case)\n", "which can be adjusted to perfectly fit the training data. If we add a\n", "new point to this plot, though, chances are it will be very far from\n", "the curve representing the degree-6 fit. In this case, we say that the\n", "model suffers from high variance. The reason for this label is that if\n", "any of the input points are varied slightly, it could result in an\n", "extremely different model.\n", "\n", "In the middle, for `d = 2`, we have found a good mid-point. It fits\n", "the data fairly well, and does not suffer from the bias and variance\n", "problems seen in the figures on either side. What we would like is a\n", "way to quantitatively identify bias and variance, and optimize the\n", "metaparameters (in this case, the polynomial degree d) in order to\n", "determine the best algorithm. This can be done through a process\n", "called cross-validation." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Validation Curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll create a dataset like in the example above, and use this to test our\n", "validation scheme. First we'll define some utility routines:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def test_func(x, err=0.5):\n", " return np.random.normal(10 - 1. / (x + 0.1), err)\n", "\n", "def compute_error(x, y, p):\n", " yfit = np.polyval(p, x)\n", " return np.sqrt(np.mean((y - yfit) ** 2))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import train_test_split\n", "\n", "N = 200\n", "f_crossval = 0.5\n", "error = 1.0\n", "\n", "# randomly sample the data\n", "np.random.seed(1)\n", "x = np.random.random(N)\n", "y = test_func(x, error)\n", "\n", "# split into training, validation, and testing sets.\n", "xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=f_crossval)\n", "\n", "# show the training and cross-validation sets\n", "plt.scatter(xtrain, ytrain, color='red')\n", "plt.scatter(xtest, ytest, color='blue')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to quantify the effects of bias and variance and construct\n", "the best possible estimator, we will split our training data into\n", "a *training set* and a *validation set*. As a general rule, the\n", "training set should be about 60% of the samples.\n", "\n", "The general idea is as follows. The model parameters (in our case,\n", "the coefficients of the polynomials) are learned using the training\n", "set as above. The error is evaluated on the cross-validation set,\n", "and the meta-parameters (in our case, the degree of the polynomial)\n", "are adjusted so that this cross-validation error is minimized.\n", "Finally, the labels are predicted for the test set. These labels\n", "are used to evaluate how well the algorithm can be expected to\n", "perform on unlabeled data.\n", "\n", "The cross-validation error of our polynomial classifier can be visualized\n", "by plotting the error as a function of the polynomial degree d. We can do\n", "this as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# suppress warnings from Polyfit\n", "import warnings\n", "warnings.filterwarnings('ignore', message='Polyfit*')\n", "\n", "degrees = np.arange(21)\n", "train_err = np.zeros(len(degrees))\n", "validation_err = np.zeros(len(degrees))\n", "\n", "for i, d in enumerate(degrees):\n", " p = np.polyfit(xtrain, ytrain, d)\n", "\n", " train_err[i] = compute_error(xtrain, ytrain, p)\n", " validation_err[i] = compute_error(xtest, ytest, p)\n", "\n", "fig, ax = plt.subplots()\n", "\n", "ax.plot(degrees, validation_err, lw=2, label = 'cross-validation error')\n", "ax.plot(degrees, train_err, lw=2, label = 'training error')\n", "ax.plot([0, 20], [error, error], '--k', label='intrinsic error')\n", "\n", "ax.legend(loc=0)\n", "ax.set_xlabel('degree of fit')\n", "ax.set_ylabel('rms error')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This figure compactly shows the reason that cross-validation is\n", "important. On the left side of the plot, we have very low-degree\n", "polynomial, which under-fits the data. This leads to a very high\n", "error for both the training set and the cross-validation set. On\n", "the far right side of the plot, we have a very high degree\n", "polynomial, which over-fits the data. This can be seen in the fact\n", "that the training error is very low, while the cross-validation\n", "error is very high. Plotted for comparison is the intrinsic error\n", "(this is the scatter artificially added to the data: click on the\n", "above image to see the source code). For this toy dataset,\n", "error = 1.0 is the best we can hope to attain. Choosing `d=6` in\n", "this case gets us very close to the optimal error.\n", "\n", "The astute reader will realize that something is amiss here: in\n", "the above plot, `d = 6` gives the best results. But in the previous\n", "plot, we found that `d = 6` vastly over-fits the data. What\u2019s going\n", "on here? The difference is the **number of training points** used.\n", "In the previous example, there were only eight training points.\n", "In this example, we have 100. As a general rule of thumb, the more\n", "training points used, the more complicated model can be used.\n", "But how can you determine for a given model whether more training\n", "points will be helpful? A useful diagnostic for this are learning curves." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Learning Curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A learning curve is a plot of the training and cross-validation\n", "error as a function of the number of training points. Note that\n", "when we train on a small subset of the training data, the training\n", "error is computed using this subset, not the full training set.\n", "These plots can give a quantitative view into how beneficial it\n", "will be to add training samples." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# suppress warnings from Polyfit\n", "import warnings\n", "warnings.filterwarnings('ignore', message='Polyfit*')\n", "\n", "def plot_learning_curve(d):\n", " sizes = np.linspace(2, N, 50).astype(int)\n", " train_err = np.zeros(sizes.shape)\n", " crossval_err = np.zeros(sizes.shape)\n", "\n", " for i, size in enumerate(sizes):\n", " # Train on only the first `size` points\n", " p = np.polyfit(xtrain[:size], ytrain[:size], d)\n", " \n", " # Validation error is on the *entire* validation set\n", " crossval_err[i] = compute_error(xtest, ytest, p)\n", " \n", " # Training error is on only the points used for training\n", " train_err[i] = compute_error(xtrain[:size], ytrain[:size], p)\n", "\n", " fig, ax = plt.subplots()\n", " ax.plot(sizes, crossval_err, lw=2, label='validation error')\n", " ax.plot(sizes, train_err, lw=2, label='training error')\n", " ax.plot([0, N], [error, error], '--k', label='intrinsic error')\n", "\n", " ax.set_xlabel('traning set size')\n", " ax.set_ylabel('rms error')\n", " \n", " ax.legend(loc=0)\n", " \n", " ax.set_xlim(0, 99)\n", "\n", " ax.set_title('d = %i' % d)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've defined this function, let's plot an example learning curve:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_learning_curve(d=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we show the learning curve for `d = 1`. From the above\n", "discussion, we know that `d = 1` is a high-bias estimator which\n", "under-fits the data. This is indicated by the fact that both the\n", "training and validation errors are very high. If this is\n", "the case, adding more training data will not help matters: both\n", "lines have converged to a relatively high error.\n", "\n", "**When the learning curves have converged, we need a more sophisticated\n", "model or more features to improve the error.**\n", "\n", "*(equivalently we can decrease regularization, which we won't discuss in this tutorial)*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_learning_curve(d=20)\n", "plt.ylim(0, 15)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we show the learning curve for `d = 20`. From the above\n", "discussion, we know that `d = 20` is a high-variance estimator\n", "which over-fits the data. This is indicated by the fact that the\n", "training error is much less than the validation error. As\n", "we add more samples to this training set, the training error will\n", "continue to climb, while the cross-validation error will continue\n", "to decrease, until they meet in the middle. In this case, our\n", "intrinsic error was set to 1.0, and we can infer that adding more\n", "data will allow the estimator to very closely match the best\n", "possible cross-validation error.\n", "\n", "**When the learning curves have not converged, it indicates that the\n", "model is too complicated for the amount of data we have. We should\n", "either find more training data, or use a simpler model.**\n", "\n", "*(equivalently we can increase __regularization__, which we won't discuss in this tutorial)*" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We\u2019ve seen above that an under-performing algorithm can be due\n", "to two possible situations: high bias (under-fitting) and high\n", "variance (over-fitting). In order to evaluate our algorithm, we\n", "set aside a portion of our training data for cross-validation.\n", "Using the technique of learning curves, we can train on progressively\n", "larger subsets of the data, evaluating the training error and\n", "cross-validation error to determine whether our algorithm has\n", "high variance or high bias. But what do we do with this information?" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "High Bias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If our algorithm shows high **bias**, the following actions might help:\n", "\n", "- **Add more features**. In our example of predicting home prices,\n", " it may be helpful to make use of information such as the neighborhood\n", " the house is in, the year the house was built, the size of the lot, etc.\n", " Adding these features to the training and test sets can improve\n", " a high-bias estimator\n", "- **Use a more sophisticated model**. Adding complexity to the model can\n", " help improve on bias. For a polynomial fit, this can be accomplished\n", " by increasing the degree d. Each learning technique has its own\n", " methods of adding complexity.\n", "- **Use fewer samples**. Though this will not improve the classification,\n", " a high-bias algorithm can attain nearly the same error with a smaller\n", " training sample. For algorithms which are computationally expensive,\n", " reducing the training sample size can lead to very large improvements\n", " in speed.\n", "- **Decrease regularization**. Regularization is a technique used to impose\n", " simplicity in some machine learning models, by adding a penalty term that\n", " depends on the characteristics of the parameters. If a model has high bias,\n", " decreasing the effect of regularization can lead to better results." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "High Variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If our algorithm shows **high variance**, the following actions might help:\n", "\n", "- **Use fewer features**. Using a feature selection technique may be\n", " useful, and decrease the over-fitting of the estimator.\n", "- **Use more training samples**. Adding training samples can reduce\n", " the effect of over-fitting, and lead to improvements in a high\n", " variance estimator.\n", "- **Increase Regularization**. Regularization is designed to prevent\n", " over-fitting. In a high-variance model, increasing regularization\n", " can lead to better results.\n", "\n", "These choices become very important in real-world situations. For example,\n", "due to limited telescope time, astronomers must seek a balance between\n", "observing a large number of objects, and observing a large number of\n", "features for each object. Determining which is more important for a\n", "particular learning task can inform the observing strategy that the\n", "astronomer employs. In a later exercise, we will explore the use of\n", "learning curves for the photometric redshift problem." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "More Sophisticated Methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot more options for performing validation and model testing.\n", "In particular, there are several schemes for cross-validation, in which\n", "the model is fit multiple times with different training and test sets.\n", "The details are different, but the principles are the same as what we've\n", "seen here.\n", "\n", "For more information see the ``sklearn.cross_validation`` module documentation,\n", "and the information on the scikit-learn website." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "One Last Caution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using validation schemes to determine hyper-parameters means that we are\n", "fitting the hyper-parameters to the particular validation set. In the same\n", "way that parameters can be over-fit to the training set, hyperparameters can\n", "be over-fit to the validation set. Because of this, the validation error\n", "tends to under-predict the classification error of new data.\n", "\n", "For this reason, it is recommended to split the data into three sets:\n", "\n", "- The **training set**, used to train the model (usually ~60% of the data)\n", "- The **validation set**, used to validate the model (usually ~20% of the data)\n", "- The **test set**, used to evaluate the expected error of the validated model (usually ~20% of the data)\n", "\n", "This may seem excessive, and many machine learning practitioners ignore the need\n", "for a test set. But if your goal is to predict the error of a model on unknown\n", "data, using a test set is vital." ] } ], "metadata": {} } ] }