{
  "metadata": {
    "name": ""
  }, 
  "nbformat": 3, 
  "nbformat_minor": 0, 
  "worksheets": [
    {
      "cells": [
        {
          "source": [
            "# Homework 1. Which of two things is larger?\n", 
            "\n", 
            "Due: Thursday, September 19, 11:59 PM\n", 
            "\n", 
            "<a href=https://raw.github.com/cs109/content/master/HW1.ipynb download=HW1.ipynb> Download this assignment</a>\n", 
            "\n", 
            "---"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "Useful libraries for this assignment\n", 
            "\n", 
            "* [numpy](http://docs.scipy.org/doc/numpy-dev/user/index.html), for arrays\n", 
            "* [pandas](http://pandas.pydata.org/), for data frames\n", 
            "* [matplotlib](http://matplotlib.org/), for plotting\n", 
            "* [requests](http://docs.python-requests.org/en/latest/), for downloading web content\n", 
            "* [pattern](http://www.clips.ua.ac.be/pages/pattern), for parsing html and xml pages\n", 
            "* [fnmatch](http://docs.python.org/2/library/fnmatch.html) (optional), for Unix-style string matching"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 1, 
          "input": [
            "# special IPython command to prepare the notebook for matplotlib\n", 
            "%matplotlib inline \n", 
            "\n", 
            "from fnmatch import fnmatch\n", 
            "\n", 
            "import numpy as np\n", 
            "import pandas as pd\n", 
            "import matplotlib.pyplot as plt\n", 
            "import requests\n", 
            "from pattern import web\n", 
            "\n", 
            "\n", 
            "# set some nicer defaults for matplotlib\n", 
            "from matplotlib import rcParams\n", 
            "\n", 
            "#these colors come from colorbrewer2.org. Each is an RGB triplet\n", 
            "dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),\n", 
            "                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),\n", 
            "                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),\n", 
            "                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),\n", 
            "                (0.4, 0.6509803921568628, 0.11764705882352941),\n", 
            "                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),\n", 
            "                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),\n", 
            "                (0.4, 0.4, 0.4)]\n", 
            "\n", 
            "rcParams['figure.figsize'] = (10, 6)\n", 
            "rcParams['figure.dpi'] = 150\n", 
            "rcParams['axes.color_cycle'] = dark2_colors\n", 
            "rcParams['lines.linewidth'] = 2\n", 
            "rcParams['axes.grid'] = True\n", 
            "rcParams['axes.facecolor'] = '#eeeeee'\n", 
            "rcParams['font.size'] = 14\n", 
            "rcParams['patch.edgecolor'] = 'none'"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "## Introduction\n", 
            "\n", 
            "This was the [XKCD comic](http://xkcd.com/1131/) after the 2012 Presidential election:\n", 
            "\n", 
            "<img src=\"http://imgs.xkcd.com/comics/math.png\">"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "The comic refers to the fact that Nate Silver's statistical model (which is based mostly on combining information from pre-election polls) correctly predicted the outcome of the 2012 presidential race in all 50 states. \n", 
            "\n", 
            "Polling data isn't a perfect predictor for the future, and some polls are more accurate than others. This means that election forecastors must consider prediction uncertainty when building models.\n", 
            "\n", 
            "In this first assignment, you will perform a simple analysis of polling data about the upcoming <a href=\"http://en.wikipedia.org/wiki/Governor_(United_States)\">Governor races</a>. The assignment has three main parts:\n", 
            "\n", 
            "**First** you will build some tools to download historical polling data from the web, and parse it into a more convenient format. \n", 
            "\n", 
            "**Next** you will use these tools to aggregate and visualize several past Governor races\n", 
            "\n", 
            "**Finally** you will run a bootstrap analysis to estimate the probable outcome of current Governor races, given the level of precision of historical polls.\n", 
            "\n", 
            "---"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "\n", 
            "\n", 
            "## Part 1: Collect and Clean\n", 
            "\n", 
            "The [Real Clear Politics](http://www.realclearpolitics.com) website archives many political polls. In addition, they combine related polls to form an \"RCP average\" estimate of public opinion over time. For example, the chart on [this page](http://www.realclearpolitics.com/epolls/2012/president/us/general_election_romney_vs_obama-1171.html) shows historical polling data for the Obama-Romney presidential race. The chart is an average of the polling data table below the chart.\n", 
            "\n", 
            "The data used to generate plots like this are stored as XML pages, with URLs like:\n", 
            "\n", 
            "http://charts.realclearpolitics.com/charts/[id].xml\n", 
            "\n", 
            "Here, [id] is a unique integer, found at the end of the URL of the page that displays the graph. The id for the Obama-Romney race is 1171:\n", 
            "\n", 
            "http://charts.realclearpolitics.com/charts/1171.xml\n", 
            "\n", 
            "Opening this page in Google Chrome or Firefox will show you the XML content in an easy-to-read format. Notice that XML tags are nested inside each other, hierarchically (the jargony term for this is the \"Document Object Model\", or \"DOM\"). The first step of webscraping is almost always exploring the HTML/XML source in a browser, and getting a sense of this hierarchy.\n", 
            "\n", 
            "---\n", 
            "\n", 
            "#### Problem 0\n", 
            "\n", 
            "The above XML page includes 5 distinct tags (one, for example, is `chart`). List these tags, and depict how they nest inside each other using an indented list. For example:\n", 
            "\n", 
            "* Page\n", 
            "  * Section\n", 
            "     * Paragraph\n", 
            "  * Conclusion"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "*Your Answer Here*"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "---\n", 
            "#### Problem 1\n", 
            "\n", 
            "We want to download and work with poll data like this. Like most programming tasks, we will break this into many smaller, easier pieces\n", 
            "\n", 
            "Fill in the code for the `get_poll_xml` function, that finds and downloads an XML page discussed above\n", 
            "\n", 
            "**Hint** \n", 
            "\n", 
            "`requests.get(\"http://www.google.com\").text` downloads the text from Google's homepage"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 2, 
          "input": [
            "\"\"\"\n", 
            "Function\n", 
            "--------\n", 
            "get_poll_xml\n", 
            "\n", 
            "Given a poll_id, return the XML data as a text string\n", 
            "\n", 
            "Inputs\n", 
            "------\n", 
            "poll_id : int\n", 
            "    The ID of the poll to fetch\n", 
            "\n", 
            "Returns\n", 
            "-------\n", 
            "xml : str\n", 
            "    The text of the XML page for that poll_id\n", 
            "\n", 
            "Example\n", 
            "-------\n", 
            ">>> get_poll_xml(1044)\n", 
            "u'<?xml version=\"1.0\" encoding=\"UTF-8\"?><chart><series><value xid=\\'0\\'>1/27/2009</value>\n", 
            "...etc...\n", 
            "\"\"\"    \n", 
            "#your code here    \n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "Here are some other functions we'll use later. `plot_colors` contains hints about parsing XML data."
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 3, 
          "input": [
            "# \"r\"egular \"e\"xpressions is kind of a mini-language to\n", 
            "# do pattern matching on text\n", 
            "import re\n", 
            "\n", 
            "def _strip(s):\n", 
            "    \"\"\"This function removes non-letter characters from a word\n", 
            "    \n", 
            "    for example _strip('Hi there!') == 'Hi there'\n", 
            "    \"\"\"\n", 
            "    return re.sub(r'[\\W_]+', '', s)\n", 
            "\n", 
            "def plot_colors(xml):\n", 
            "    \"\"\"\n", 
            "    Given an XML document like the link above, returns a python dictionary\n", 
            "    that maps a graph title to a graph color.\n", 
            "    \n", 
            "    Both the title and color are parsed from attributes of the <graph> tag:\n", 
            "    <graph title=\"the title\", color=\"#ff0000\"> -> {'the title': '#ff0000'}\n", 
            "    \n", 
            "    These colors are in \"hex string\" format. This page explains them:\n", 
            "    http://coding.smashingmagazine.com/2012/10/04/the-code-side-of-color/\n", 
            "    \n", 
            "    Example\n", 
            "    -------\n", 
            "    >>> plot_colors(get_poll_xml(1044))\n", 
            "    {u'Approve': u'#000000', u'Disapprove': u'#FF0000'}\n", 
            "    \"\"\"\n", 
            "    dom = web.Element(xml)\n", 
            "    result = {}\n", 
            "    for graph in dom.by_tag('graph'):\n", 
            "        title = _strip(graph.attributes['title'])\n", 
            "        result[title] = graph.attributes['color']\n", 
            "    return result"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "---\n", 
            "\n", 
            "#### Problem 2\n", 
            "\n", 
            "Even though `get_poll_xml` pulls data from the web into Python, it does so as a block of text. This still isn't very useful. Use the `web` module in `pattern` to parse this text, and extract data into a pandas DataFrame.\n", 
            "\n", 
            "**Hints**\n", 
            "\n", 
            "* You might want create python lists for each column in the XML. Then, to turn these lists into a DataFrame, run\n", 
            "\n", 
            "`pd.DataFrame({'column_label_1': list_1, 'column_label_2':list_2, ...})`\n", 
            "\n", 
            "* use the pandas function `pd.to_datetime` to convert strings into dates"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 4, 
          "input": [
            "\"\"\"\n", 
            "    Function\n", 
            "    ---------\n", 
            "    rcp_poll_data\n", 
            "\n", 
            "    Extract poll information from an XML string, and convert to a DataFrame\n", 
            "\n", 
            "    Parameters\n", 
            "    ----------\n", 
            "    xml : str\n", 
            "        A string, containing the XML data from a page like \n", 
            "        get_poll_xml(1044)\n", 
            "        \n", 
            "    Returns\n", 
            "    -------\n", 
            "    A pandas DataFrame with the following columns:\n", 
            "        date: The date for each entry\n", 
            "        title_n: The data value for the gid=n graph (take the column name from the `title` tag)\n", 
            "        \n", 
            "    This DataFrame should be sorted by date\n", 
            "        \n", 
            "    Example\n", 
            "    -------\n", 
            "    Consider the following simple xml page:\n", 
            "    \n", 
            "    <chart>\n", 
            "    <series>\n", 
            "    <value xid=\"0\">1/27/2009</value>\n", 
            "    <value xid=\"1\">1/28/2009</value>\n", 
            "    </series>\n", 
            "    <graphs>\n", 
            "    <graph gid=\"1\" color=\"#000000\" balloon_color=\"#000000\" title=\"Approve\">\n", 
            "    <value xid=\"0\">63.3</value>\n", 
            "    <value xid=\"1\">63.3</value>\n", 
            "    </graph>\n", 
            "    <graph gid=\"2\" color=\"#FF0000\" balloon_color=\"#FF0000\" title=\"Disapprove\">\n", 
            "    <value xid=\"0\">20.0</value>\n", 
            "    <value xid=\"1\">20.0</value>\n", 
            "    </graph>\n", 
            "    </graphs>\n", 
            "    </chart>\n", 
            "    \n", 
            "    Given this string, rcp_poll_data should return\n", 
            "    result = pd.DataFrame({'date': pd.to_datetime(['1/27/2009', '1/28/2009']), \n", 
            "                           'Approve': [63.3, 63.3], 'Disapprove': [20.0, 20.0]})\n", 
            "\"\"\"\n", 
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "The output from `rcp_poll_data` is much more useful for analysis. For example, we can plot with it:"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 5, 
          "input": [
            "def poll_plot(poll_id):\n", 
            "    \"\"\"\n", 
            "    Make a plot of an RCP Poll over time\n", 
            "    \n", 
            "    Parameters\n", 
            "    ----------\n", 
            "    poll_id : int\n", 
            "        An RCP poll identifier\n", 
            "    \"\"\"\n", 
            "\n", 
            "    # hey, you wrote two of these functions. Thanks for that!\n", 
            "    xml = get_poll_xml(poll_id)\n", 
            "    data = rcp_poll_data(xml)\n", 
            "    colors = plot_colors(xml)\n", 
            "\n", 
            "    #remove characters like apostrophes\n", 
            "    data = data.rename(columns = {c: _strip(c) for c in data.columns})\n", 
            "\n", 
            "    #normalize poll numbers so they add to 100%    \n", 
            "    norm = data[colors.keys()].sum(axis=1) / 100    \n", 
            "    for c in colors.keys():\n", 
            "        data[c] /= norm\n", 
            "    \n", 
            "    for label, color in colors.items():\n", 
            "        plt.plot(data.date, data[label], color=color, label=label)        \n", 
            "        \n", 
            "    plt.xticks(rotation=70)\n", 
            "    plt.legend(loc='best')\n", 
            "    plt.xlabel(\"Date\")\n", 
            "    plt.ylabel(\"Normalized Poll Percentage\")"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "If you've done everything right so far, the following code should reproduce the graph on [this page](http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html)"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 6, 
          "input": [
            "poll_plot(1044)\n", 
            "plt.title(\"Obama Job Approval\")"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "---\n", 
            "\n", 
            "## Part 2: Aggregate and Visualize\n"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "#### Problem 3\n", 
            "\n", 
            "Unfortunately, these data don't have any error bars. If a candidate leads by 10% in the RCP average, is she a shoo-in to win? Or is this number too close to call? Does a 10% poll lead mean more 1 day before a race than it does 1 week before? Without error estimates, these questions are impossible to answer.\n", 
            "\n", 
            "To get a sense of how accurate the RCP polls are, you will gather data from many previous Governor races, where the outcome is known.\n", 
            "\n", 
            "This url has links to many governer races. \n", 
            "\n", 
            "http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html\n", 
            "\n", 
            "Notice that each link to a governor race has the following URL pattern:\n", 
            "\n", 
            "http://www.realclearpolitics.com/epolls/[YEAR]/governor/[STATE]/[TITLE]-[ID].html\n", 
            "\n", 
            "\n", 
            "Write a function that scans html for links to URLs like this\n", 
            "\n", 
            "**Hint** The [fnmatch](http://docs.python.org/2/library/fnmatch.html) function is useful for simple string matching tasks."
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 7, 
          "input": [
            "\"\"\"\n", 
            "    Function\n", 
            "    --------\n", 
            "    find_governor_races\n", 
            "\n", 
            "    Find and return links to RCP races on a page like\n", 
            "    http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html\n", 
            "    \n", 
            "    Parameters\n", 
            "    ----------\n", 
            "    html : str\n", 
            "        The HTML content of a page to scan\n", 
            "        \n", 
            "    Returns\n", 
            "    -------\n", 
            "    A list of urls for Governer race pages\n", 
            "    \n", 
            "    Example\n", 
            "    -------\n", 
            "    For a page like\n", 
            "    \n", 
            "    <html>\n", 
            "    <body>\n", 
            "    <a href=\"http://www.realclearpolitics.com/epolls/2010/governor/ma/massachusetts_governor_baker_vs_patrick_vs_cahill-1154.html\"></a>\n", 
            "    <a href=\"http://www.realclearpolitics.com/epolls/2010/governor/ca/california_governor_whitman_vs_brown-1113.html\"></a>\n", 
            "    </body>\n", 
            "    </html>\n", 
            "    \n", 
            "    find_governor_races would return\n", 
            "    ['http://www.realclearpolitics.com/epolls/2010/governor/ma/massachusetts_governor_baker_vs_patrick_vs_cahill-1154.html',\n", 
            "     'http://www.realclearpolitics.com/epolls/2010/governor/ca/california_governor_whitman_vs_brown-1113.html']\n", 
            "\"\"\"\n", 
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "#### Problem 4\n", 
            "\n", 
            "At this point, you have functions to find a collection of governor races, download historical polling data from each one,\n", 
            "parse them into a numerical DataFrame, and plot this data.\n", 
            "\n", 
            "The main question we have about these data are how accurately they predict election outcomes. To answer this question, we\n", 
            "need to grab the election outcome data.\n", 
            "\n", 
            "Write a function that looks up and returns the election result on a page like [this one](http://www.realclearpolitics.com/epolls/2010/governor/ca/california_governor_whitman_vs_brown-1113.html). \n", 
            "\n", 
            "**Remember to look at the HTML source!**\n", 
            "\n", 
            "You can do this by selection `view->developer->view source` in Chrome, or `Tools -> web developer -> page source` in Firefox. Altenatively, you can right-click on a part of the page, and select \"inspect element\""
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 8, 
          "input": [
            "\"\"\"\n", 
            "    Function\n", 
            "    --------\n", 
            "    race_result\n", 
            "\n", 
            "    Return the actual voting results on a race page\n", 
            "    \n", 
            "    Parameters\n", 
            "    ----------\n", 
            "    url : string\n", 
            "        The website to search through\n", 
            "        \n", 
            "    Returns\n", 
            "    -------\n", 
            "    A dictionary whose keys are candidate names,\n", 
            "    and whose values is the percentage of votes they received.\n", 
            "    \n", 
            "    If necessary, normalize these numbers so that they add up to 100%.\n", 
            "    \n", 
            "    Example\n", 
            "    --------\n", 
            "    >>> url = 'http://www.realclearpolitics.com/epolls/2010/governor/ca/california_governor_whitman_vs_brown-1113.html'\n", 
            "    >>> race_result(url)\n", 
            "    {'Brown': 56.0126582278481, 'Whitman': 43.9873417721519}\n", 
            "\"\"\"\n", 
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "Here are some more utility functions that take advantage of what you've done so far."
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 9, 
          "input": [
            "def id_from_url(url):\n", 
            "    \"\"\"Given a URL, look up the RCP identifier number\"\"\"\n", 
            "    return url.split('-')[-1].split('.html')[0]\n", 
            "\n", 
            "\n", 
            "def plot_race(url):\n", 
            "    \"\"\"Make a plot summarizing a senate race\n", 
            "    \n", 
            "    Overplots the actual race results as dashed horizontal lines\n", 
            "    \"\"\"\n", 
            "    #hey, thanks again for these functions!\n", 
            "    id = id_from_url(url)\n", 
            "    xml = get_poll_xml(id)    \n", 
            "    colors = plot_colors(xml)\n", 
            "\n", 
            "    if len(colors) == 0:\n", 
            "        return\n", 
            "    \n", 
            "    #really, you shouldn't have\n", 
            "    result = race_result(url)\n", 
            "    \n", 
            "    poll_plot(id)\n", 
            "    plt.xlabel(\"Date\")\n", 
            "    plt.ylabel(\"Polling Percentage\")\n", 
            "    for r in result:\n", 
            "        plt.axhline(result[r], color=colors[_strip(r)], alpha=0.6, ls='--')\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "Now that this is done, we can easily visualize many historical Governer races. The solid line plots the poll history, the dotted line reports the actual result.\n", 
            "\n", 
            "If this code block fails, you probably have a bug in one of your functions."
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 10, 
          "input": [
            "page = requests.get('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html').text.encode('ascii', 'ignore')\n", 
            "\n", 
            "for race in find_governor_races(page):\n", 
            "    plot_race(race)\n", 
            "    plt.show()"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "Briefly summarize these graphs -- how accurate is the typical poll a day before the election? How often does a prediction one month before the election mispredict the actual winner?"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "**Your summary here**"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "---\n", 
            "\n", 
            "## Part 3: Analysis\n", 
            "\n", 
            "#### Problem 5\n", 
            "\n", 
            "You are (finally!) in a position to do some quantitative analysis.\n", 
            "\n", 
            "We have provided an `error_data` function that builds upon the functions you have written. It computes a new DataFrame with information about polling errors.\n", 
            "\n", 
            "Use `error_data`, `find_governer_races`, and `pd.concat` to construct a Data Frame summarizing the forecast errors\n", 
            "from all the Governor races\n", 
            "\n", 
            "**Hint** \n", 
            "\n", 
            "It's best to set `ignore_index=True` in `pd.concat`"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 11, 
          "input": [
            "def party_from_color(color):\n", 
            "    if color in ['#0000CC', '#3B5998']:\n", 
            "        return 'democrat'\n", 
            "    if color in ['#FF0000', '#D30015']:\n", 
            "        return 'republican'\n", 
            "    return 'other'\n", 
            "\n", 
            "\n", 
            "def error_data(url):\n", 
            "    \"\"\"\n", 
            "    Given a Governor race URL, download the poll data and race result,\n", 
            "    and construct a DataFrame with the following columns:\n", 
            "    \n", 
            "    candidate: Name of the candidate\n", 
            "    forecast_length: Number of days before the election\n", 
            "    percentage: The percent of poll votes a candidate has.\n", 
            "                Normalized to that the canddidate percentages add to 100%\n", 
            "    error: Difference between percentage and actual race reulst\n", 
            "    party: Political party of the candidate\n", 
            "    \n", 
            "    The data are resampled as necessary, to provide one data point per day\n", 
            "    \"\"\"\n", 
            "    \n", 
            "    id = id_from_url(url)\n", 
            "    xml = get_poll_xml(id)\n", 
            "    \n", 
            "    colors = plot_colors(xml)\n", 
            "    if len(colors) == 0:\n", 
            "        return pd.DataFrame()\n", 
            "    \n", 
            "    df = rcp_poll_data(xml)\n", 
            "    result = race_result(url)\n", 
            "    \n", 
            "    #remove non-letter characters from columns\n", 
            "    df = df.rename(columns={c: _strip(c) for c in df.columns})\n", 
            "    for k, v in result.items():\n", 
            "        result[_strip(k)] = v \n", 
            "    \n", 
            "    candidates = [c for c in df.columns if c is not 'date']\n", 
            "        \n", 
            "    #turn into a timeseries...\n", 
            "    df.index = df.date\n", 
            "    \n", 
            "    #...so that we can resample at regular, daily intervals\n", 
            "    df = df.resample('D')\n", 
            "    df = df.dropna()\n", 
            "    \n", 
            "    #compute forecast length in days\n", 
            "    #(assuming that last forecast happens on the day of the election, for simplicity)\n", 
            "    forecast_length = (df.date.max() - df.date).values\n", 
            "    forecast_length = forecast_length / np.timedelta64(1, 'D')  # convert to number of days\n", 
            "    \n", 
            "    #compute forecast error\n", 
            "    errors = {}\n", 
            "    normalized = {}\n", 
            "    poll_lead = {}\n", 
            "    \n", 
            "    for c in candidates:\n", 
            "        #turn raw percentage into percentage of poll votes\n", 
            "        corr = df[c].values / df[candidates].sum(axis=1).values * 100.\n", 
            "        err = corr - result[_strip(c)]\n", 
            "        \n", 
            "        normalized[c] = corr\n", 
            "        errors[c] = err\n", 
            "        \n", 
            "    n = forecast_length.size\n", 
            "    \n", 
            "    result = {}\n", 
            "    result['percentage'] = np.hstack(normalized[c] for c in candidates)\n", 
            "    result['error'] = np.hstack(errors[c] for c in candidates)\n", 
            "    result['candidate'] = np.hstack(np.repeat(c, n) for c in candidates)\n", 
            "    result['party'] = np.hstack(np.repeat(party_from_color(colors[_strip(c)]), n) for c in candidates)\n", 
            "    result['forecast_length'] = np.hstack(forecast_length for _ in candidates)\n", 
            "    \n", 
            "    result = pd.DataFrame(result)\n", 
            "    return result"
          ], 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 12, 
          "input": [
            "\"\"\"\n", 
            "function\n", 
            "---------\n", 
            "all_error_data\n", 
            "\n", 
            "Calls error_data on all races from find_governer_races(page),\n", 
            "and concatenates into a single DataFrame\n", 
            "\n", 
            "Parameters\n", 
            "-----------\n", 
            "None\n", 
            "\n", 
            "Examples\n", 
            "--------\n", 
            "df = all_error_data()\n", 
            "\"\"\"\n", 
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 13, 
          "input": [
            "errors = all_error_data()"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "Here's a histogram of the error of every polling measurement in the data"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 14, 
          "input": [
            "errors.error.hist(bins=50)\n", 
            "plt.xlabel(\"Polling Error\")\n", 
            "plt.ylabel('N')"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "### Problem 6\n", 
            "\n", 
            "Compute the standard deviation of the polling errors. How much uncertainty is there in the typical RCP poll?"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 15, 
          "input": [
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "### Problem 7\n", 
            "\n", 
            "Repeat this calculation for the data where `errors.forecast_length < 7` (i.e. the polls within a week of an election). How much more/less accurate are they? How about the data where `errors.forecast_length > 30`? \n", 
            "\n", 
            "**Comment on this in 1 or 2 sentences**. Does this make sense?"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 16, 
          "input": [
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "### Problem 8\n", 
            "\n", 
            "**Bootstrap resampling** is a general purpose way to use empirical data like the `errors` DataFrame to estimate uncertainties. For example, consider the [Viriginia Governor Race](http://www.realclearpolitics.com/epolls/2013/governor/va/virginia_governor_cuccinelli_vs_mcauliffe-3033.html). If we wanted to estimate how likey it is that McAuliffe will win given the current RCP data, the approch would be:\n", 
            "\n", 
            "1. Pick a large number N of experiments to run (say N=1000).\n", 
            "2. For each experiment, randomly select a value from `errors.error`. We are assuming that these numbers represent a reasonable error distribution for the current poll data.\n", 
            "3. Assume that the error on McAullife's current polling score is given by this number (and, by extension, the error on Cuccinelli's poll score is the opposite). Calculate who actually wins the election in this simulation.\n", 
            "4. Repeat N times, and calculate the percentage of simulations where either candidate wins.\n", 
            "\n", 
            "Bootstrapping isn't foolproof: it makes the assumption that the previous Governor race errors are representative of the Virginia race, and it does a bad job at estimating very rare events (with only ~30 races in the errors DataFrame, it would be hard to accurately predict probabilities for 1-in-a-million scenarios). Nevertheless, it's a versatile technique.\n", 
            "\n", 
            "Use bootstrap resampling to estimate how likely it is that each candidate could win the following races.\n", 
            "\n", 
            " * [Virginia Governor](http://www.realclearpolitics.com/epolls/2013/governor/va/virginia_governor_cuccinelli_vs_mcauliffe-3033.html)\n", 
            " * [New Jersey Governor](http://www.realclearpolitics.com/epolls/2013/governor/nj/new_jersey_governor_christie_vs_buono-3411.html)\n", 
            " \n", 
            "**Summarize your results in a paragraph. What conclusions do you draw from the bootstrap analysis, and what assumptions did you make in reaching this conclusion. What are some limitations of this analysis?**\n", 
            " "
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "cell_type": "code", 
          "language": "python", 
          "outputs": [], 
          "collapsed": false, 
          "prompt_number": 17, 
          "input": [
            "#your code here\n"
          ], 
          "metadata": {}
        }, 
        {
          "source": [
            "**Your summary here**"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "## Parting Thoughts\n", 
            "\n", 
            "For comparison, most of the predictions in Nate Silver's [presidental forecast](http://fivethirtyeight.blogs.nytimes.com/fivethirtyeights-2012-forecast/) had confidences of >95%. This is more precise than what we can estimate from the RCP poll alone. His approach, however, is the same basic idea (albeit he used many more polls, and carefully calibrated each based on demographic and other information). Homework 2 will dive into some of his techniques further.\n", 
            "\n", 
            "\n", 
            "## How to submit\n", 
            "\n", 
            "To submit your homework, create a folder named lastname_firstinitial_hw0 and place this notebook file in the folder. If your notebook requires any additional data files to run (it shouldn't), add them to this directory as well. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work."
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }, 
        {
          "source": [
            "---\n", 
            "*css tweaks in this cell*\n", 
            "<style>\n", 
            "div.text_cell_render {\n", 
            "    line-height: 150%;\n", 
            "    font-size: 110%;\n", 
            "    width: 800px;\n", 
            "    margin-left:50px;\n", 
            "    margin-right:auto;\n", 
            "    }\n", 
            "</style>"
          ], 
          "cell_type": "markdown", 
          "metadata": {}
        }
      ], 
      "metadata": {}
    }
  ]
}