{
 "metadata": {
  "name": "DataCleanup"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's how I used [pandas](http://pandas.pydata.org) to rearrange and clean up the [raw tide data from NOAA](http://tidesandcurrents.noaa.gov/data_menu.shtml?plot_backup=1&bdate=20121029&edate=20121030&datum=6&unit=1&shift=d&stn=8518750+The+Battery%2C+NY&type=Tide+Data&format=View+Data)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Read the data via pandas' flexible [read_table function](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_table.html). This returns a [DataFrame](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d = pd.read_table('BatteryParkTideData_Cleaned.txt', sep='\\s+', parse_dates=[[1, 2]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "             Date_Time  ?Station  Pred6  Backup  Acoustc\n",
        "0  2012-10-29 00:00:00   8518750   1.59    4.68     4.65\n",
        "1  2012-10-29 00:06:00   8518750   1.50    4.55     4.54\n",
        "2  2012-10-29 00:12:00   8518750   1.40    4.46     4.44\n",
        "3  2012-10-29 00:18:00   8518750   1.31    4.36     4.33\n",
        "4  2012-10-29 00:24:00   8518750   1.22    4.28     4.26"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the data I output I don't want dates, I just want the time since the first measurement, which is a simple, easy to work with floating point number.\n",
      "So next I add a new column to `d` that's the difference between each timestamp and the first timestamp:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d['TimeOffset'] = d['Date_Time'] - d['Date_Time'][0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "             Date_Time  ?Station  Pred6  Backup  Acoustc TimeOffset\n",
        "0  2012-10-29 00:00:00   8518750   1.59    4.68     4.65    0:00:00\n",
        "1  2012-10-29 00:06:00   8518750   1.50    4.55     4.54    0:06:00\n",
        "2  2012-10-29 00:12:00   8518750   1.40    4.46     4.44    0:12:00\n",
        "3  2012-10-29 00:18:00   8518750   1.31    4.36     4.33    0:18:00\n",
        "4  2012-10-29 00:24:00   8518750   1.22    4.28     4.26    0:24:00"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now I have the time since the first measurement, but it's still in a [datetime](http://docs.python.org/2/library/datetime.html) format, specifically a [timedelta](http://docs.python.org/2/library/datetime.html#timedelta-objects). To convert that to hours I use the [total_seconds](http://docs.python.org/2/library/datetime.html#datetime.timedelta.total_seconds) methods and divide by 3600. I add that to `d` as another new column:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d['TimeOffsetHours'] = pd.Series(to.total_seconds() / 3600. for to in d['TimeOffset'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally I write a CSV using the [to_csv](http://pandas.pydata.org/pandas-docs/dev/io.html#writing-to-csv-format) method. I'm writing only the time since first measurement, predicted measurement, and measurement columns:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "d.to_csv('BatteryParkTideData.csv', na_rep='NA', cols=['TimeOffsetHours', 'Pred6', 'Backup', 'Acoustc'], index=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}