{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas\n", "======\n", "Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importing Data\n", "--------------\n", "Importing data from text files is easy." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x y z\n", "0 1 2.0 3\n", "1 2 2.4 6\n", "2 3 1.9 8\n" ] } ], "source": [ "import pandas as pd\n", "data = pd.read_csv('../data/examp-data.txt')\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also choose to use the first row of the data as an the index (the numbers to the left of the table). The index has special properties and is particularly useful for time-series data. Values in index columns should be unique." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " y z\n", "x \n", "1 2.0 3\n", "2 2.4 6\n", "3 1.9 8\n" ] } ], "source": [ "data_w_index = pd.read_csv('../data/examp-data.txt', index_col=0)\n", "print(data_w_index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can export to a number of different formats directly from Pandas including: csv, Excel, json, hdf, and SQL. You can export things to using the `.to_*` methods on a data frame. So, if you want to export to csv:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",x,y,z\r\n", "0,1,2.0,3\r\n", "1,2,2.4,6\r\n", "2,3,1.9,8\r\n" ] } ], "source": [ "data.to_csv('../data/pandas_output.csv')\n", "!cat ./data/pandas_output.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `!cat` command is just a quick way to show what the resulting datafile looks like.\n", "\n", "If you don't want the index information you can easily get rid of it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x,y,z\r\n", "1,2.0,3\r\n", "2,2.4,6\r\n", "3,1.9,8\r\n" ] } ], "source": [ "data.to_csv('../data/pandas_output.csv', index=False)\n", "!cat ./data/pandas_output.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting columns\n", "-----------------\n", "Specific columns can be selected using their names. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2.0\n", "1 2.4\n", "2 1.9\n", "Name: y, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['y']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple columns can selected using a list of names in the desired order." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zx
0 3 1
1 6 2
2 8 3
\n", "
" ], "text/plain": [ " z x\n", "0 3 1\n", "1 6 2\n", "2 8 3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['z', 'x']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows\n", "--------------\n", "Rows can be selected by their index (the values to the left-hand side of the table) or by position." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yz
x
1 2.0 3
2 2.4 6
3 1.9 8
\n", "
" ], "text/plain": [ " y z\n", "x \n", "1 2.0 3\n", "2 2.4 6\n", "3 1.9 8" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_w_index" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "y 2.4\n", "z 6.0\n", "Name: 2, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_w_index.loc[2]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "y 2\n", "z 3\n", "Name: 1, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_w_index.iloc[0]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "x 1\n", "y 2\n", "z 3\n", "Name: 0, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also grab a slice of rows from the dataframe." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yz
x
1 2.0 3
2 2.4 6
\n", "
" ], "text/plain": [ " y z\n", "x \n", "1 2.0 3\n", "2 2.4 6" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_w_index[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looping over rows\n", "\n", "Looping over the rows in a dataframe involves looping over the data itself, which is stored in `data.values`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 2. 4. 6.]\n", "[ 4. 4.8 12. ]\n", "[ 6. 3.8 16. ]\n" ] } ], "source": [ "for row in data.values:\n", " print row * 2" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "x 2.0\n", "y 2.4\n", "z 6.0\n", "Name: 1, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.values[0]\n", "data.ix[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is because often when working with data frames we won't want to loop over rows but instead use vectorized operations for things like subsetting and math (see below)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Subsetting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get subsets of the data by giving conditions inside `[]`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
1 2 2.4 6
\n", "
" ], "text/plain": [ " x y z\n", "1 2 2.4 6" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data['y'] > 2.0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntax for specifying multiple conditions for subsetting is to include each condition in `()` and separate them with the `&` symbol." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
1 2 2.4 6
\n", "
" ], "text/plain": [ " x y z\n", "1 2 2.4 6" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(data['x'] > 1) & (data['y'] > 2)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Math is done by treating columsn from the data from just like they are variables." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 8.0\n", "1 16.4\n", "2 17.2\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['y'] * data['z'] + 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to use mathematical functions use the ones in `numpy` not the ones in `math`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.200566\n", "1 2.144452\n", "2 1.815437\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.log(data['y']) * np.sqrt(data['z'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To work on chunks of data in automated ways we can use grouping. First, let's grab some data." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
orderfamilyGenusspeciesmass(g)gestation(mo)newborn(g)weaning(mo)wean mass(g)AFR(mo)max. life(mo)litter sizelitters/yearrefs
0 Artiodactyla Antilocapridae Antilocapra americana 45375 8.13 3246.36 3.00 8900 13.53 142 1.85 1.00 1,2,6,9,23,26,27
1 Artiodactyla Bovidae Addax nasomaculatus 182375 9.39 5480.00 6.50 -999 27.27 308 1.00 0.99 1,2,17,23,26
2 Artiodactyla Bovidae Aepyceros melampus 41480 6.35 5093.00 5.63 15900 16.66 213 1.00 0.95 1,2,8,9,23,29
3 Artiodactyla Bovidae Alcelaphus buselaphus 150000 7.90 10166.67 6.50 -999 23.02 240 1.00-999.00 1,2,17,23
4 Artiodactyla Bovidae Ammodorcas clarkei 28500 6.80 -999.00-999.00 -999-999.00-999 1.00-999.00 1,2
\n", "
" ], "text/plain": [ " order family Genus species mass(g) \\\n", "0 Artiodactyla Antilocapridae Antilocapra americana 45375 \n", "1 Artiodactyla Bovidae Addax nasomaculatus 182375 \n", "2 Artiodactyla Bovidae Aepyceros melampus 41480 \n", "3 Artiodactyla Bovidae Alcelaphus buselaphus 150000 \n", "4 Artiodactyla Bovidae Ammodorcas clarkei 28500 \n", "\n", " gestation(mo) newborn(g) weaning(mo) wean mass(g) AFR(mo) \\\n", "0 8.13 3246.36 3.00 8900 13.53 \n", "1 9.39 5480.00 6.50 -999 27.27 \n", "2 6.35 5093.00 5.63 15900 16.66 \n", "3 7.90 10166.67 6.50 -999 23.02 \n", "4 6.80 -999.00 -999.00 -999 -999.00 \n", "\n", " max. life(mo) litter size litters/year refs \n", "0 142 1.85 1.00 1,2,6,9,23,26,27 \n", "1 308 1.00 0.99 1,2,17,23,26 \n", "2 213 1.00 0.95 1,2,8,9,23,29 \n", "3 240 1.00 -999.00 1,2,17,23 \n", "4 -999 1.00 -999.00 1,2 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"http://esapubs.org/archive/ecol/E084/093/Mammal_lifehistories_v2.txt\"\n", "data = pd.read_csv(url, delimiter=\"\\t\")\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The command `data.head()` shows us the first few rows of the dataset.\n", "\n", "The following code groups the data based on the value in the `order` column, calculates the average of the `mass(g)` column for every order, and prints out the results." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The average mass of Artiodactyla is 112939.74354 grams\n", "The average mass of Carnivora is 42705.6617766 grams\n", "The average mass of Cetacea is 9115442.46673 grams\n", "The average mass of Dermoptera is 0.5 grams\n", "The average mass of Hyracoidea is 3030.835 grams\n", "The average mass of Insectivora is 58.0276923077 grams\n", "The average mass of Lagomorpha is 1316.13261905 grams\n", "The average mass of Macroscelidea is 11.522 grams\n", "The average mass of Perissodactyla is 694486.666667 grams\n", "The average mass of Pholidota is 7980.0 grams\n", "The average mass of Primates is 5066.5575641 grams\n", "The average mass of Proboscidea is 3342500.0 grams\n", "The average mass of Rodentia is 496.813082707 grams\n", "The average mass of Scandentia is 190.357142857 grams\n", "The average mass of Sirenia is 1169400.0 grams\n", "The average mass of Tubulidentata is 60000.0 grams\n", "The average mass of Xenarthra is 7238.5 grams\n" ] } ], "source": [ "data_by_order = data.groupby('order')\n", "for order, order_data in data_by_order:\n", " avg_mass = np.mean(order_data['mass(g)'])\n", " print \"The average mass of {} is {} grams\".format(order, avg_mass)" ] } ], "metadata": { "anaconda-cloud": {} }, "nbformat": 4, "nbformat_minor": 0 }