{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# A Beginner's Introduction to Pandas"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Welcome!\n",
      "\n",
      "- About me! Who is @marcelcaraciolo ?\n",
      "- Environment + data files check: [http://marcelcaraciolo.github.io/big-data-tutorial](http://marcelcaraciolo.github.io/big-data-tutorial)\n",
      "- Wakari.io  check!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## About this tutorial\n",
      "We've put this together from our experience and a number of sources, please check the references at the bottom of this document.\n",
      "\n",
      "### What this tutorial is\n",
      "\n",
      "The goal of this tutorial is to provide you with a hands-on overview of two of\n",
      "the main libraries from the scientific and data analysis communities. We're going to\n",
      "use:\n",
      "\n",
      "- ipython -- [ipython.org](http://ipython.org/)\n",
      "- numpy -- [numpy.org](http://www.numpy.org)\n",
      "- pandas -- [pandas.pydata.org](http://pandas.pydata.org/)\n",
      "- scikit-learn -- [scikit-learn.org] (http://scikit-learn.org)\n",
      "- mrJob -- [http://pythonhosted.org/mrjob/] (http://pythonhosted.org/mrjob/)\n",
      "- (bonus) pytables -- [pytables.org](http://www.pytables.org)\n",
      "\n",
      "### What this tutorial is not\n",
      "\n",
      "- A machine learning course\n",
      "- A python course\n",
      "- Advanced Big Data Course \n",
      "- An exhaustive overview of the recommendation literature\n",
      "- A set of recipes that will win you the next Netflix/Kaggle/? challenge."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Roadmap\n",
      "\n",
      "What exactly are we going to do? Here's high-level overview:\n",
      "\n",
      "- learn about NumPy arrays\n",
      "- learn about DataFrames"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## NumPy: Numerical Python (30 min)\n",
      "\n",
      "### What is it?\n",
      "\n",
      "*It is a Python library that provides a multidimensional array object, various\n",
      "derived objects (such as masked arrays and matrices), and an assortment of\n",
      "routines for fast operations on arrays, including mathematical, logical, shape\n",
      "manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear\n",
      "algebra, basic statistical operations, random simulation and much more.*"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "\n",
      "# set some print options\n",
      "np.set_printoptions(precision=4)\n",
      "np.set_printoptions(threshold=5)\n",
      "np.set_printoptions(suppress=True)\n",
      "\n",
      "# init random gen\n",
      "np.random.seed(2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### NumPy's basic data structure: the ndarray\n",
      "\n",
      "Think of ndarrays as the building blocks for pydata. A multidimensional array\n",
      "object that acts as a container for data to be passed between algorithms. Also,\n",
      "libraries written in a lower-level language, such as C or Fortran, can operate\n",
      "on the data stored in a NumPy array without copying any data.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "\n",
      "# build an array using the array function\n",
      "arr = np.array([0, 9, 5, 4, 3])\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "array([0, 9, 5, 4, 3])"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Array creation examples\n",
      "\n",
      "There are several functions that are used to create new arrays:\n",
      "\n",
      "- `np.array`\n",
      "- `np.asarray`\n",
      "- `np.arange`\n",
      "- `np.ones`\n",
      "- `np.ones_like`\n",
      "- `np.zeros`\n",
      "- `np.zeros_like`\n",
      "- `np.empty`\n",
      "- `np.random.randn` and other funcs from the random module"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.zeros(4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "array([ 0.,  0.,  0.,  0.])"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.ones(4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "array([ 1.,  1.,  1.,  1.])"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.empty(4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "array([ 0.,  0.,  0.,  0.])"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.arange(4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "array([0, 1, 2, 3])"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### dtype and shape\n",
      "\n",
      "NumPy's arrays are containers of homogeneous data, which means all elements are\n",
      "of the same type. The 'dtype' propery is an object that specifies the data type\n",
      "of each element. The 'shape' property is a tuple that indicates the size of each\n",
      "dimension."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.random.randn(5)\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "array([-0.4168, -0.0563, -2.1362,  1.6403, -1.7934])"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr.dtype"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "dtype('float64')"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "(5,)"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# you can be explicit about the data type that you want\n",
      "np.empty(4, dtype=np.int32)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "array([     0,      0,      0, 131072])"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.array(['numpy','pandas','pytables'], dtype=np.string_)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "array(['numpy', 'pandas', 'pytables'], \n",
        "      dtype='|S8')"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "float_arr = np.array([4.4, 5.52425, -0.1234, 98.1], dtype=np.float64)\n",
      "# truncate the decimal part\n",
      "float_arr.astype(np.int32)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "array([ 4,  5,  0, 98])"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Indexing and slicing\n",
      "\n",
      "#### Just what you would expect from Python"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.array([0, 9, 1, 4, 64])\n",
      "arr[3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "4"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr[1:3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "array([9, 1])"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr[:2]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "array([0, 9])"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set the last two elements to 555\n",
      "arr[-2:] = 55\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "array([ 0,  9,  1, 55, 55])"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Indexing behaviour for multidimensional arrays\n",
      "\n",
      "A good way to think about indexing in multidimensional arrays is that you are\n",
      "moving along the values of the shape property. So, a 4d array `arr_4d`, with a\n",
      "shape of `(w,x,y,z)` will result in indexed views such that:\n",
      "\n",
      "- `arr_4d[i].shape == (x,y,z)`\n",
      "- `arr_4d[i,j].shape == (y,z)`\n",
      "- `arr_4d[i,j,k].shape == (z,)`\n",
      "\n",
      "For the case of slices, what you are doing is selecting a range of elements\n",
      "along a particular axis:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr_2d = np.array([[5,3,4],[0,1,2],[1,1,10],[0,0,0.1]])\n",
      "arr_2d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 24,
       "text": [
        "array([[  5. ,   3. ,   4. ],\n",
        "       [  0. ,   1. ,   2. ],\n",
        "       [  1. ,   1. ,  10. ],\n",
        "       [  0. ,   0. ,   0.1]])"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get the first row\n",
      "arr_2d[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 25,
       "text": [
        "array([ 5.,  3.,  4.])"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get the first column\n",
      "arr_2d[:,0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 26,
       "text": [
        "array([ 5.,  0.,  1.,  0.])"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get the first two rows\n",
      "arr_2d[:2]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 27,
       "text": [
        "array([[ 5.,  3.,  4.],\n",
        "       [ 0.,  1.,  2.]])"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Careful, it's a view!\n",
      "\n",
      "A slice does not return a copy, which means that any modifications will be\n",
      "reflected in the source array. This is a design feature of NumPy to avoid memory\n",
      "problems."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.array([0, 3, 1, 4, 64])\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "array([ 0,  3,  1,  4, 64])"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "subarr = arr[2:4]\n",
      "subarr[1] = 99\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 29,
       "text": [
        "array([ 0,  3,  1, 99, 64])"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### (Fancy) Boolean indexing\n",
      "\n",
      "Boolean indexing allows you to select data subsets of an array that satisfy a\n",
      "given condition."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.array([10, 20])\n",
      "idx = np.array([True, False])\n",
      "arr[idx]\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 31,
       "text": [
        "array([10])"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr_2d = np.random.randn(5)\n",
      "arr_2d\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 32,
       "text": [
        "array([-0.8417,  0.5029, -1.2453, -1.058 , -0.909 ])"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr_2d < 0"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 33,
       "text": [
        "array([ True, False,  True,  True,  True], dtype=bool)"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr_2d[arr_2d < 0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 34,
       "text": [
        "array([-0.8417, -1.2453, -1.058 , -0.909 ])"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 35,
       "text": [
        "array([], dtype=float64)"
       ]
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr_2d[arr_2d < 0] = 0\n",
      "arr_2d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 36,
       "text": [
        "array([ 0.    ,  0.5029,  0.    ,  0.    ,  0.    ])"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### (Fancy) list-of-locations indexing\n",
      "\n",
      "Fancy indexing is indexing with integer arrays."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.arange(18).reshape(6,3)\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "array([[ 0,  1,  2],\n",
        "       [ 3,  4,  5],\n",
        "       [ 6,  7,  8],\n",
        "       [ 9, 10, 11],\n",
        "       [12, 13, 14],\n",
        "       [15, 16, 17]])"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# fancy selection of rows in a particular order\n",
      "arr[[0,4,4]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 39,
       "text": [
        "array([[ 0,  1,  2],\n",
        "       [12, 13, 14],\n",
        "       [12, 13, 14]])"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# index into individual elements and flatten\n",
      "arr[[5,3,1]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "array([[15, 16, 17],\n",
        "       [ 9, 10, 11],\n",
        "       [ 3,  4,  5]])"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr[[5,3,1],[2,1,0]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "array([17, 10,  3])"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# select a submatrix\n",
      "arr[np.ix_([5,3,1],[2,1])]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 41,
       "text": [
        "array([[17, 16],\n",
        "       [11, 10],\n",
        "       [ 5,  4]])"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Vectorization\n",
      "Vectorization is at the heart of NumPy and it enables us to express operations\n",
      "without writing any for loops.  Operations between arrays with equal shapes are\n",
      "performed element-wise.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.array([0, 9, 1.02, 4, 32])\n",
      "arr - arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 44,
       "text": [
        "array([ 0.,  0.,  0.,  0.,  0.])"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr * arr\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 45,
       "text": [
        "array([    0.    ,    81.    ,     1.0404,    16.    ,  1024.    ])"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Broadcasting Rules\n",
      "\n",
      "Vectorized operations between arrays of different sizes and between arrays and\n",
      "scalars are subject to the rules of broadcasting. The idea is quite simple in\n",
      "many cases:\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.array([0, 9, 1.02, 4, 64])\n",
      "5 * arr "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 47,
       "text": [
        "array([   0. ,   45. ,    5.1,   20. ,  320. ])"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "10 + arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 48,
       "text": [
        "array([ 10.  ,  19.  ,  11.02,  14.  ,  74.  ])"
       ]
      }
     ],
     "prompt_number": 48
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr ** .5"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 49,
       "text": [
        "array([ 0.  ,  3.  ,  1.01,  2.  ,  8.  ])"
       ]
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The case of arrays of different shapes is slightly more complicated. The gist of it is that the shape of the operands need to conform to a certain specification. Don't worry if this does not make sense right away."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "arr = np.random.randn(4,2)\n",
      "arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 51,
       "text": [
        "array([[ 0.5515,  2.2922],\n",
        "       [ 0.0415, -1.1179],\n",
        "       [ 0.5391, -0.5962],\n",
        "       [-0.0191,  1.175 ]])"
       ]
      }
     ],
     "prompt_number": 51
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mean_row = np.mean(arr, axis=0)\n",
      "mean_row"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 52,
       "text": [
        "array([ 0.2782,  0.4383])"
       ]
      }
     ],
     "prompt_number": 52
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "centered_rows = arr - mean_row\n",
      "centered_rows"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 53,
       "text": [
        "array([[ 0.2732,  1.8539],\n",
        "       [-0.2367, -1.5562],\n",
        "       [ 0.2608, -1.0344],\n",
        "       [-0.2974,  0.7367]])"
       ]
      }
     ],
     "prompt_number": 53
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean(centered_rows, axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 54,
       "text": [
        "array([-0.,  0.])"
       ]
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mean_col = np.mean(arr, axis=1)\n",
      "mean_col"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 55,
       "text": [
        "array([ 1.4218, -0.5382, -0.0286,  0.5779])"
       ]
      }
     ],
     "prompt_number": 55
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "centered_cols = arr - mean_col"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "ValueError",
       "evalue": "operands could not be broadcast together with shapes (4,2) (4) ",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-56-bd5236897883>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcentered_cols\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0marr\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mmean_col\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
        "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (4,2) (4) "
       ]
      }
     ],
     "prompt_number": 56
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# make the 1-D array a column vector\n",
      "mean_col.reshape((4,1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 57,
       "text": [
        "array([[ 1.4218],\n",
        "       [-0.5382],\n",
        "       [-0.0286],\n",
        "       [ 0.5779]])"
       ]
      }
     ],
     "prompt_number": 57
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "centered_cols = arr - mean_col.reshape((4,1))\n",
      "centered_rows"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 58,
       "text": [
        "array([[ 0.2732,  1.8539],\n",
        "       [-0.2367, -1.5562],\n",
        "       [ 0.2608, -1.0344],\n",
        "       [-0.2974,  0.7367]])"
       ]
      }
     ],
     "prompt_number": 58
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "centered_cols.mean(axis=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 59,
       "text": [
        "array([-0.,  0.,  0., -0.])"
       ]
      }
     ],
     "prompt_number": 59
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### A note about NANs: \n",
      "\n",
      "Per the floating point standard IEEE 754, NaN is a floating point value that, by definition, is not equal to any other floating point value."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.nan != np.nan"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 60,
       "text": [
        "True"
       ]
      }
     ],
     "prompt_number": 60
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.array([10,5,4,np.nan,1,np.nan]) == np.nan"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 61,
       "text": [
        "array([False, False, False, False, False, False], dtype=bool)"
       ]
      }
     ],
     "prompt_number": 61
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.isnan(np.array([10,5,4,np.nan,1,np.nan]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 62,
       "text": [
        "array([False, False, False,  True, False,  True], dtype=bool)"
       ]
      }
     ],
     "prompt_number": 62
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## pandas: Python Data Analysis Library (30 min)\n",
      "\n",
      "### What is it?\n",
      "\n",
      "*Python has long been great for data munging and preparation, but less so for\n",
      "data analysis and modeling. pandas helps fill this gap, enabling you to carry\n",
      "out your entire data analysis workflow in Python without having to switch to a\n",
      "more domain specific language like R.*\n",
      "\n",
      "The heart of pandas is the DataFrame object for data manipulation. It features:\n",
      "\n",
      "- a powerful index object\n",
      "- data alignment\n",
      "- handling of missing data\n",
      "- aggregation with groupby\n",
      "- data manipuation via reshape, pivot, slice, merge, join"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "\n",
      "pd.set_printoptions(precision=3, notebook_repr_html=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 68
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Series: labelled arrays\n",
      "\n",
      "The pandas Series is the simplest datastructure to start with. It is a subclass\n",
      "of ndarray that supports more meaninful indices."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Let's look at some creation examples for Series"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "\n",
      "values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])\n",
      "ser = pd.Series(values)\n",
      "print ser"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0     2.0000\n",
        "1     1.0000\n",
        "2     5.0000\n",
        "3     0.9700\n",
        "4     3.0000\n",
        "5    10.0000\n",
        "6     0.0599\n",
        "7     8.0000\n",
        "dtype: float64\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])\n",
      "labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']\n",
      "ser = pd.Series(data=values, index=labels)\n",
      "print ser\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "A     2.00\n",
        "B     1.00\n",
        "C     5.00\n",
        "D     0.97\n",
        "E     3.00\n",
        "F    10.00\n",
        "G     0.06\n",
        "H     8.00\n",
        "dtype: float64\n"
       ]
      }
     ],
     "prompt_number": 71
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "movie_rating = {\n",
      "    'age': 1,\n",
      "    'gender': 'F',\n",
      "    'genres': 'Drama',\n",
      "    'movie_id': 1193,\n",
      "    'occupation': 10,\n",
      "    'rating': 5,\n",
      "    'timestamp': 978300760,\n",
      "    'title': \"One Flew Over the Cuckoo's Nest (1975)\",\n",
      "    'user_id': 1,\n",
      "    'zip': '48067'\n",
      "    }\n",
      "ser = pd.Series(movie_rating)\n",
      "print ser\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "age                                                1\n",
        "gender                                             F\n",
        "genres                                         Drama\n",
        "movie_id                                        1193\n",
        "occupation                                        10\n",
        "rating                                             5\n",
        "timestamp                                  978300760\n",
        "title         One Flew Over the Cuckoo's Nest (1975)\n",
        "user_id                                            1\n",
        "zip                                            48067\n",
        "dtype: object\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ser.index"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "Index([u'age', u'gender', u'genres', u'movie_id', u'occupation', u'rating', u'timestamp', u'title', u'user_id', u'zip'], dtype=object)"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ser.values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "array([1, F, Drama, 1193, 10, 5, 978300760,\n",
        "       One Flew Over the Cuckoo's Nest (1975), 1, 48067], dtype=object)"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Series indexing"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ser[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "1"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ser['gender']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": [
        "'F'"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ser.get_value('gender')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "'F'"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Operations between Series with different index objects"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'])\n",
      "ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'])\n",
      "print ser_1 + ser_2\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "A     6\n",
        "B   NaN\n",
        "C     9\n",
        "G   NaN\n",
        "dtype: float64\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### DataFrame\n",
      "\n",
      "The DataFrame is the 2-dimensional version of a Series.\n",
      "\n",
      "#### Let's look at some creation examples for DataFrame\n",
      "\n",
      "You can think of it as a spreadsheet whose columns are Series objects."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# build from a dict of equal-length lists or ndarrays\n",
      "pd.DataFrame({'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>col_1</th>\n",
        "      <th>col_2</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>  0.12</td>\n",
        "      <td>  0.9</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>  7.00</td>\n",
        "      <td>  9.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 45.00</td>\n",
        "      <td> 34.0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 10.00</td>\n",
        "      <td> 11.0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "   col_1  col_2\n",
        "0   0.12    0.9\n",
        "1   7.00    9.0\n",
        "2  45.00   34.0\n",
        "3  10.00   11.0"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can explicitly set the column names and index values as well."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},\n",
      "             columns=['col_1', 'col_2', 'col_3'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>col_1</th>\n",
        "      <th>col_2</th>\n",
        "      <th>col_3</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>  0.12</td>\n",
        "      <td>  0.9</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>  7.00</td>\n",
        "      <td>  9.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 45.00</td>\n",
        "      <td> 34.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 10.00</td>\n",
        "      <td> 11.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "   col_1  col_2 col_3\n",
        "0   0.12    0.9   NaN\n",
        "1   7.00    9.0   NaN\n",
        "2  45.00   34.0   NaN\n",
        "3  10.00   11.0   NaN"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},\n",
      "             columns=['col_1', 'col_2', 'col_3'],\n",
      "             index=['obs1', 'obs2', 'obs3', 'obs4'])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>col_1</th>\n",
        "      <th>col_2</th>\n",
        "      <th>col_3</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>obs1</th>\n",
        "      <td>  0.12</td>\n",
        "      <td>  0.9</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs2</th>\n",
        "      <td>  7.00</td>\n",
        "      <td>  9.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs3</th>\n",
        "      <td> 45.00</td>\n",
        "      <td> 34.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs4</th>\n",
        "      <td> 10.00</td>\n",
        "      <td> 11.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "      col_1  col_2 col_3\n",
        "obs1   0.12    0.9   NaN\n",
        "obs2   7.00    9.0   NaN\n",
        "obs3  45.00   34.0   NaN\n",
        "obs4  10.00   11.0   NaN"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can also think of it as a dictionary of Series objects."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "movie_rating = {\n",
      "    'gender': 'F',\n",
      "    'genres': 'Drama',\n",
      "    'movie_id': 1193,\n",
      "    'rating': 5,\n",
      "    'timestamp': 978300760,\n",
      "    'user_id': 1,\n",
      "    }\n",
      "ser_1 = pd.Series(movie_rating)\n",
      "ser_2 = pd.Series(movie_rating)\n",
      "df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})\n",
      "df.columns.name = 'rating_events'\n",
      "df.index.name = 'rating_data'\n",
      "df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>rating_events</th>\n",
        "      <th>r_1</th>\n",
        "      <th>r_2</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating_data</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>gender</th>\n",
        "      <td>         F</td>\n",
        "      <td>         F</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>genres</th>\n",
        "      <td>     Drama</td>\n",
        "      <td>     Drama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>movie_id</th>\n",
        "      <td>      1193</td>\n",
        "      <td>      1193</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating</th>\n",
        "      <td>         5</td>\n",
        "      <td>         5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>timestamp</th>\n",
        "      <td> 978300760</td>\n",
        "      <td> 978300760</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>user_id</th>\n",
        "      <td>         1</td>\n",
        "      <td>         1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "rating_events        r_1        r_2\n",
        "rating_data                        \n",
        "gender                 F          F\n",
        "genres             Drama      Drama\n",
        "movie_id            1193       1193\n",
        "rating                 5          5\n",
        "timestamp      978300760  978300760\n",
        "user_id                1          1"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = df.T\n",
      "df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th>rating_data</th>\n",
        "      <th>gender</th>\n",
        "      <th>genres</th>\n",
        "      <th>movie_id</th>\n",
        "      <th>rating</th>\n",
        "      <th>timestamp</th>\n",
        "      <th>user_id</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating_events</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>r_1</th>\n",
        "      <td> F</td>\n",
        "      <td> Drama</td>\n",
        "      <td> 1193</td>\n",
        "      <td> 5</td>\n",
        "      <td> 978300760</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>r_2</th>\n",
        "      <td> F</td>\n",
        "      <td> Drama</td>\n",
        "      <td> 1193</td>\n",
        "      <td> 5</td>\n",
        "      <td> 978300760</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "rating_data   gender genres movie_id rating  timestamp user_id\n",
        "rating_events                                                 \n",
        "r_1                F  Drama     1193      5  978300760       1\n",
        "r_2                F  Drama     1193      5  978300760       1"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.columns "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "Index([u'gender', u'genres', u'movie_id', u'rating', u'timestamp', u'user_id'], dtype=object)"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.index"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "Index([u'r_1', u'r_2'], dtype=object)"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "array([[F, Drama, 1193, 5, 978300760, 1],\n",
        "       [F, Drama, 1193, 5, 978300760, 1]], dtype=object)"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Adding/Deleting entries"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})\n",
      "df.drop('genres', axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>r_1</th>\n",
        "      <th>r_2</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating_data</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>gender</th>\n",
        "      <td>         F</td>\n",
        "      <td>         F</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>movie_id</th>\n",
        "      <td>      1193</td>\n",
        "      <td>      1193</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating</th>\n",
        "      <td>         5</td>\n",
        "      <td>         5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>timestamp</th>\n",
        "      <td> 978300760</td>\n",
        "      <td> 978300760</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>user_id</th>\n",
        "      <td>         1</td>\n",
        "      <td>         1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "                   r_1        r_2\n",
        "rating_data                      \n",
        "gender               F          F\n",
        "movie_id          1193       1193\n",
        "rating               5          5\n",
        "timestamp    978300760  978300760\n",
        "user_id              1          1"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.drop('r_1', axis=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>r_2</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating_data</th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>gender</th>\n",
        "      <td>         F</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>genres</th>\n",
        "      <td>     Drama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>movie_id</th>\n",
        "      <td>      1193</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating</th>\n",
        "      <td>         5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>timestamp</th>\n",
        "      <td> 978300760</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>user_id</th>\n",
        "      <td>         1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "                   r_2\n",
        "rating_data           \n",
        "gender               F\n",
        "genres           Drama\n",
        "movie_id          1193\n",
        "rating               5\n",
        "timestamp    978300760\n",
        "user_id              1"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# careful with the order here\n",
      "df['r_3'] = ['F', 'Drama', 1193, 5, 978300760, 1]\n",
      "df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>r_1</th>\n",
        "      <th>r_2</th>\n",
        "      <th>r_3</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating_data</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>gender</th>\n",
        "      <td>         F</td>\n",
        "      <td>         F</td>\n",
        "      <td>         F</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>genres</th>\n",
        "      <td>     Drama</td>\n",
        "      <td>     Drama</td>\n",
        "      <td>     Drama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>movie_id</th>\n",
        "      <td>      1193</td>\n",
        "      <td>      1193</td>\n",
        "      <td>      1193</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>rating</th>\n",
        "      <td>         5</td>\n",
        "      <td>         5</td>\n",
        "      <td>         5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>timestamp</th>\n",
        "      <td> 978300760</td>\n",
        "      <td> 978300760</td>\n",
        "      <td> 978300760</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>user_id</th>\n",
        "      <td>         1</td>\n",
        "      <td>         1</td>\n",
        "      <td>         1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 24,
       "text": [
        "                   r_1        r_2        r_3\n",
        "rating_data                                 \n",
        "gender               F          F          F\n",
        "genres           Drama      Drama      Drama\n",
        "movie_id          1193       1193       1193\n",
        "rating               5          5          5\n",
        "timestamp    978300760  978300760  978300760\n",
        "user_id              1          1          1"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### DataFrame indexing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can index into a column using it's label, or with dot notation\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},\n",
      "                  columns=['col_1', 'col_2', 'col_3'],\n",
      "                  index=['obs1', 'obs2', 'obs3', 'obs4'])\n",
      "df['col_1']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 26,
       "text": [
        "obs1     0.12\n",
        "obs2     7.00\n",
        "obs3    45.00\n",
        "obs4    10.00\n",
        "Name: col_1, dtype: float64"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.col_1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 27,
       "text": [
        "obs1     0.12\n",
        "obs2     7.00\n",
        "obs3    45.00\n",
        "obs4    10.00\n",
        "Name: col_1, dtype: float64"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can also use multiple columns to select a subset of them:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df[['col_2', 'col_1']]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>col_2</th>\n",
        "      <th>col_1</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>obs1</th>\n",
        "      <td>  0.9</td>\n",
        "      <td>  0.12</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs2</th>\n",
        "      <td>  9.0</td>\n",
        "      <td>  7.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs3</th>\n",
        "      <td> 34.0</td>\n",
        "      <td> 45.00</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs4</th>\n",
        "      <td> 11.0</td>\n",
        "      <td> 10.00</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "      col_2  col_1\n",
        "obs1    0.9   0.12\n",
        "obs2    9.0   7.00\n",
        "obs3   34.0  45.00\n",
        "obs4   11.0  10.00"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The .ix method gives you the most flexibility to index into certain rows, or\n",
      "even rows and columns:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.ix['obs3']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 29,
       "text": [
        "col_1     45\n",
        "col_2     34\n",
        "col_3    NaN\n",
        "Name: obs3, dtype: object"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.ix[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 30,
       "text": [
        "col_1    0.12\n",
        "col_2     0.9\n",
        "col_3     NaN\n",
        "Name: obs1, dtype: object"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.ix[:2]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>col_1</th>\n",
        "      <th>col_2</th>\n",
        "      <th>col_3</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>obs1</th>\n",
        "      <td> 0.12</td>\n",
        "      <td> 0.9</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs2</th>\n",
        "      <td> 7.00</td>\n",
        "      <td> 9.0</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 31,
       "text": [
        "      col_1  col_2 col_3\n",
        "obs1   0.12    0.9   NaN\n",
        "obs2   7.00    9.0   NaN"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.ix[:2, 'col_2']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 32,
       "text": [
        "obs1    0.9\n",
        "obs2    9.0\n",
        "Name: col_2, dtype: float64"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.ix[:2, ['col_1', 'col_2']]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>col_1</th>\n",
        "      <th>col_2</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>obs1</th>\n",
        "      <td> 0.12</td>\n",
        "      <td> 0.9</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>obs2</th>\n",
        "      <td> 7.00</td>\n",
        "      <td> 9.0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 33,
       "text": [
        "      col_1  col_2\n",
        "obs1   0.12    0.9\n",
        "obs2   7.00    9.0"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## The MovieLens dataset: loading and first look\n",
      "\n",
      "Loading of the MovieLens dataset here is based on the intro chapter of 'Python\n",
      "for Data Analysis\".\n",
      "\n",
      "The MovieLens data is spread across three files. Using the `pd.read_table`\n",
      "method we load each file:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "\n",
      "unames = ['user_id', 'gender', 'age', 'occupation', 'zip']\n",
      "users = pd.read_table('../data/ml-1m/users.dat',\n",
      "                      sep='::', header=None, names=unames)\n",
      "\n",
      "rnames = ['user_id', 'movie_id', 'rating', 'timestamp']\n",
      "ratings = pd.read_table('../data/ml-1m/ratings.dat',\n",
      "                        sep='::', header=None, names=rnames)\n",
      "\n",
      "mnames = ['movie_id', 'title', 'genres']\n",
      "movies = pd.read_table('../data/ml-1m/movies.dat',\n",
      "                       sep='::', header=None, names=mnames)\n",
      "\n",
      "# show how one of them looks\n",
      "ratings.head(5)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>user_id</th>\n",
        "      <th>movie_id</th>\n",
        "      <th>rating</th>\n",
        "      <th>timestamp</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 1</td>\n",
        "      <td> 1193</td>\n",
        "      <td> 5</td>\n",
        "      <td> 978300760</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 1</td>\n",
        "      <td>  661</td>\n",
        "      <td> 3</td>\n",
        "      <td> 978302109</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 1</td>\n",
        "      <td>  914</td>\n",
        "      <td> 3</td>\n",
        "      <td> 978301968</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 1</td>\n",
        "      <td> 3408</td>\n",
        "      <td> 4</td>\n",
        "      <td> 978300275</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 1</td>\n",
        "      <td> 2355</td>\n",
        "      <td> 5</td>\n",
        "      <td> 978824291</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 37,
       "text": [
        "   user_id  movie_id  rating  timestamp\n",
        "0        1      1193       5  978300760\n",
        "1        1       661       3  978302109\n",
        "2        1       914       3  978301968\n",
        "3        1      3408       4  978300275\n",
        "4        1      2355       5  978824291"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What are we going to do next ?\n",
      "\n",
      "- Playing with Recommender Systems"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## References and further reading\n",
      "\n",
      "- William Wesley McKinney. Python for Data Analysis. O\u2019Reilly, 2012.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    }
   ],
   "metadata": {}
  }
 ]
}