{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas\n",
    "======\n",
    "Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Importing Data\n",
    "--------------\n",
    "Importing data from text files is easy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   x    y  z\n",
      "0  1  2.0  3\n",
      "1  2  2.4  6\n",
      "2  3  1.9  8\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "data = pd.read_csv('../data/examp-data.txt')\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also choose to use the first row of the data as an the index (the numbers to the left of the table). The index has special properties and is particularly useful for time-series data. Values in index columns should be unique."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     y  z\n",
      "x        \n",
      "1  2.0  3\n",
      "2  2.4  6\n",
      "3  1.9  8\n"
     ]
    }
   ],
   "source": [
    "data_w_index = pd.read_csv('../data/examp-data.txt', index_col=0)\n",
    "print(data_w_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exporting Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can export to a number of different formats directly from Pandas including: csv, Excel, json, hdf, and SQL. You can export things to using the `.to_*` methods on a data frame. So, if you want to export to csv:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ",x,y,z\r\n",
      "0,1,2.0,3\r\n",
      "1,2,2.4,6\r\n",
      "2,3,1.9,8\r\n"
     ]
    }
   ],
   "source": [
    "data.to_csv('../data/pandas_output.csv')\n",
    "!cat ./data/pandas_output.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `!cat` command is just a quick way to show what the resulting datafile looks like.\n",
    "\n",
    "If you don't want the index information you can easily get rid of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x,y,z\r\n",
      "1,2.0,3\r\n",
      "2,2.4,6\r\n",
      "3,1.9,8\r\n"
     ]
    }
   ],
   "source": [
    "data.to_csv('../data/pandas_output.csv', index=False)\n",
    "!cat ./data/pandas_output.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting columns\n",
    "-----------------\n",
    "Specific columns can be selected using their names. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2.0\n",
       "1    2.4\n",
       "2    1.9\n",
       "Name: y, dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['y']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multiple columns can selected using a list of names in the desired order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>z</th>\n",
       "      <th>x</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td> 3</td>\n",
       "      <td> 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 6</td>\n",
       "      <td> 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td> 8</td>\n",
       "      <td> 3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   z  x\n",
       "0  3  1\n",
       "1  6  2\n",
       "2  8  3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['z', 'x']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows\n",
    "--------------\n",
    "Rows can be selected by their index (the values to the left-hand side of the table) or by position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>x</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 2.0</td>\n",
       "      <td> 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td> 2.4</td>\n",
       "      <td> 6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td> 1.9</td>\n",
       "      <td> 8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     y  z\n",
       "x        \n",
       "1  2.0  3\n",
       "2  2.4  6\n",
       "3  1.9  8"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_w_index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "y    2.4\n",
       "z    6.0\n",
       "Name: 2, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_w_index.loc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "y    2\n",
       "z    3\n",
       "Name: 1, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_w_index.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "x    1\n",
       "y    2\n",
       "z    3\n",
       "Name: 0, dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also grab a slice of rows from the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>x</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 2.0</td>\n",
       "      <td> 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td> 2.4</td>\n",
       "      <td> 6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     y  z\n",
       "x        \n",
       "1  2.0  3\n",
       "2  2.4  6"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_w_index[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looping over rows\n",
    "\n",
    "Looping over the rows in a dataframe involves looping over the data itself, which is stored in `data.values`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 2.  4.  6.]\n",
      "[  4.    4.8  12. ]\n",
      "[  6.    3.8  16. ]\n"
     ]
    }
   ],
   "source": [
    "for row in data.values:\n",
    "    print row * 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "x    2.0\n",
       "y    2.4\n",
       "z    6.0\n",
       "Name: 1, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.values[0]\n",
    "data.ix[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is because often when working with data frames we won't want to loop over rows but instead use vectorized operations for things like subsetting and math (see below)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Subsetting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can get subsets of the data by giving conditions inside `[]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 2</td>\n",
       "      <td> 2.4</td>\n",
       "      <td> 6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x    y  z\n",
       "1  2  2.4  6"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[data['y'] > 2.0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The syntax for specifying multiple conditions for subsetting is to include each condition in `()` and separate them with the `&` symbol."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>y</th>\n",
       "      <th>z</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> 2</td>\n",
       "      <td> 2.4</td>\n",
       "      <td> 6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x    y  z\n",
       "1  2  2.4  6"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[(data['x'] > 1) & (data['y'] > 2)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Math"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Math is done by treating columsn from the data from just like they are variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     8.0\n",
       "1    16.4\n",
       "2    17.2\n",
       "dtype: float64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['y'] * data['z'] + 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to use mathematical functions use the ones in `numpy` not the ones in `math`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.200566\n",
       "1    2.144452\n",
       "2    1.815437\n",
       "dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "np.log(data['y']) * np.sqrt(data['z'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To work on chunks of data in automated ways we can use grouping. First, let's grab some data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>order</th>\n",
       "      <th>family</th>\n",
       "      <th>Genus</th>\n",
       "      <th>species</th>\n",
       "      <th>mass(g)</th>\n",
       "      <th>gestation(mo)</th>\n",
       "      <th>newborn(g)</th>\n",
       "      <th>weaning(mo)</th>\n",
       "      <th>wean mass(g)</th>\n",
       "      <th>AFR(mo)</th>\n",
       "      <th>max. life(mo)</th>\n",
       "      <th>litter size</th>\n",
       "      <th>litters/year</th>\n",
       "      <th>refs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td> Artiodactyla</td>\n",
       "      <td> Antilocapridae</td>\n",
       "      <td> Antilocapra</td>\n",
       "      <td>     americana</td>\n",
       "      <td>  45375</td>\n",
       "      <td> 8.13</td>\n",
       "      <td>  3246.36</td>\n",
       "      <td>   3.00</td>\n",
       "      <td>  8900</td>\n",
       "      <td>  13.53</td>\n",
       "      <td> 142</td>\n",
       "      <td> 1.85</td>\n",
       "      <td>   1.00</td>\n",
       "      <td> 1,2,6,9,23,26,27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td> Artiodactyla</td>\n",
       "      <td>        Bovidae</td>\n",
       "      <td>       Addax</td>\n",
       "      <td> nasomaculatus</td>\n",
       "      <td> 182375</td>\n",
       "      <td> 9.39</td>\n",
       "      <td>  5480.00</td>\n",
       "      <td>   6.50</td>\n",
       "      <td>  -999</td>\n",
       "      <td>  27.27</td>\n",
       "      <td> 308</td>\n",
       "      <td> 1.00</td>\n",
       "      <td>   0.99</td>\n",
       "      <td>     1,2,17,23,26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td> Artiodactyla</td>\n",
       "      <td>        Bovidae</td>\n",
       "      <td>   Aepyceros</td>\n",
       "      <td>      melampus</td>\n",
       "      <td>  41480</td>\n",
       "      <td> 6.35</td>\n",
       "      <td>  5093.00</td>\n",
       "      <td>   5.63</td>\n",
       "      <td> 15900</td>\n",
       "      <td>  16.66</td>\n",
       "      <td> 213</td>\n",
       "      <td> 1.00</td>\n",
       "      <td>   0.95</td>\n",
       "      <td>    1,2,8,9,23,29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td> Artiodactyla</td>\n",
       "      <td>        Bovidae</td>\n",
       "      <td>  Alcelaphus</td>\n",
       "      <td>    buselaphus</td>\n",
       "      <td> 150000</td>\n",
       "      <td> 7.90</td>\n",
       "      <td> 10166.67</td>\n",
       "      <td>   6.50</td>\n",
       "      <td>  -999</td>\n",
       "      <td>  23.02</td>\n",
       "      <td> 240</td>\n",
       "      <td> 1.00</td>\n",
       "      <td>-999.00</td>\n",
       "      <td>        1,2,17,23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td> Artiodactyla</td>\n",
       "      <td>        Bovidae</td>\n",
       "      <td>  Ammodorcas</td>\n",
       "      <td>       clarkei</td>\n",
       "      <td>  28500</td>\n",
       "      <td> 6.80</td>\n",
       "      <td>  -999.00</td>\n",
       "      <td>-999.00</td>\n",
       "      <td>  -999</td>\n",
       "      <td>-999.00</td>\n",
       "      <td>-999</td>\n",
       "      <td> 1.00</td>\n",
       "      <td>-999.00</td>\n",
       "      <td>              1,2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          order          family        Genus        species  mass(g)  \\\n",
       "0  Artiodactyla  Antilocapridae  Antilocapra      americana    45375   \n",
       "1  Artiodactyla         Bovidae        Addax  nasomaculatus   182375   \n",
       "2  Artiodactyla         Bovidae    Aepyceros       melampus    41480   \n",
       "3  Artiodactyla         Bovidae   Alcelaphus     buselaphus   150000   \n",
       "4  Artiodactyla         Bovidae   Ammodorcas        clarkei    28500   \n",
       "\n",
       "   gestation(mo)  newborn(g)  weaning(mo)  wean mass(g)  AFR(mo)  \\\n",
       "0           8.13     3246.36         3.00          8900    13.53   \n",
       "1           9.39     5480.00         6.50          -999    27.27   \n",
       "2           6.35     5093.00         5.63         15900    16.66   \n",
       "3           7.90    10166.67         6.50          -999    23.02   \n",
       "4           6.80     -999.00      -999.00          -999  -999.00   \n",
       "\n",
       "   max. life(mo)  litter size  litters/year              refs  \n",
       "0            142         1.85          1.00  1,2,6,9,23,26,27  \n",
       "1            308         1.00          0.99      1,2,17,23,26  \n",
       "2            213         1.00          0.95     1,2,8,9,23,29  \n",
       "3            240         1.00       -999.00         1,2,17,23  \n",
       "4           -999         1.00       -999.00               1,2  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = \"http://esapubs.org/archive/ecol/E084/093/Mammal_lifehistories_v2.txt\"\n",
    "data = pd.read_csv(url, delimiter=\"\\t\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The command `data.head()` shows us the first few rows of the dataset.\n",
    "\n",
    "The following code groups the data based on the value in the `order` column, calculates the average of the `mass(g)` column for every order, and prints out the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The average mass of Artiodactyla is 112939.74354 grams\n",
      "The average mass of Carnivora is 42705.6617766 grams\n",
      "The average mass of Cetacea is 9115442.46673 grams\n",
      "The average mass of Dermoptera is 0.5 grams\n",
      "The average mass of Hyracoidea is 3030.835 grams\n",
      "The average mass of Insectivora is 58.0276923077 grams\n",
      "The average mass of Lagomorpha is 1316.13261905 grams\n",
      "The average mass of Macroscelidea is 11.522 grams\n",
      "The average mass of Perissodactyla is 694486.666667 grams\n",
      "The average mass of Pholidota is 7980.0 grams\n",
      "The average mass of Primates is 5066.5575641 grams\n",
      "The average mass of Proboscidea is 3342500.0 grams\n",
      "The average mass of Rodentia is 496.813082707 grams\n",
      "The average mass of Scandentia is 190.357142857 grams\n",
      "The average mass of Sirenia is 1169400.0 grams\n",
      "The average mass of Tubulidentata is 60000.0 grams\n",
      "The average mass of Xenarthra is 7238.5 grams\n"
     ]
    }
   ],
   "source": [
    "data_by_order = data.groupby('order')\n",
    "for order, order_data in data_by_order:\n",
    "    avg_mass = np.mean(order_data['mass(g)'])\n",
    "    print \"The average mass of {} is {} grams\".format(order, avg_mass)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {}
 },
 "nbformat": 4,
 "nbformat_minor": 0
}