{ "metadata": { "name": "DataCleanup" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how I used [pandas](http://pandas.pydata.org) to rearrange and clean up the [raw tide data from NOAA](http://tidesandcurrents.noaa.gov/data_menu.shtml?plot_backup=1&bdate=20121029&edate=20121030&datum=6&unit=1&shift=d&stn=8518750+The+Battery%2C+NY&type=Tide+Data&format=View+Data)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data via pandas' flexible [read_table function](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_table.html). This returns a [DataFrame](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d = pd.read_table('BatteryParkTideData_Cleaned.txt', sep='\\s+', parse_dates=[[1, 2]])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "d.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 3, "text": [ " Date_Time ?Station Pred6 Backup Acoustc\n", "0 2012-10-29 00:00:00 8518750 1.59 4.68 4.65\n", "1 2012-10-29 00:06:00 8518750 1.50 4.55 4.54\n", "2 2012-10-29 00:12:00 8518750 1.40 4.46 4.44\n", "3 2012-10-29 00:18:00 8518750 1.31 4.36 4.33\n", "4 2012-10-29 00:24:00 8518750 1.22 4.28 4.26" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the data I output I don't want dates, I just want the time since the first measurement, which is a simple, easy to work with floating point number.\n", "So next I add a new column to `d` that's the difference between each timestamp and the first timestamp:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d['TimeOffset'] = d['Date_Time'] - d['Date_Time'][0]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "d.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 7, "text": [ " Date_Time ?Station Pred6 Backup Acoustc TimeOffset\n", "0 2012-10-29 00:00:00 8518750 1.59 4.68 4.65 0:00:00\n", "1 2012-10-29 00:06:00 8518750 1.50 4.55 4.54 0:06:00\n", "2 2012-10-29 00:12:00 8518750 1.40 4.46 4.44 0:12:00\n", "3 2012-10-29 00:18:00 8518750 1.31 4.36 4.33 0:18:00\n", "4 2012-10-29 00:24:00 8518750 1.22 4.28 4.26 0:24:00" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I have the time since the first measurement, but it's still in a [datetime](http://docs.python.org/2/library/datetime.html) format, specifically a [timedelta](http://docs.python.org/2/library/datetime.html#timedelta-objects). To convert that to hours I use the [total_seconds](http://docs.python.org/2/library/datetime.html#datetime.timedelta.total_seconds) methods and divide by 3600. I add that to `d` as another new column:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d['TimeOffsetHours'] = pd.Series(to.total_seconds() / 3600. for to in d['TimeOffset'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally I write a CSV using the [to_csv](http://pandas.pydata.org/pandas-docs/dev/io.html#writing-to-csv-format) method. I'm writing only the time since first measurement, predicted measurement, and measurement columns:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "d.to_csv('BatteryParkTideData.csv', na_rep='NA', cols=['TimeOffsetHours', 'Pred6', 'Backup', 'Acoustc'], index=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }