{ "metadata": { "name": "Chapter5" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Mining the Social Web, 1st Edition - The Tweet, The Whole Tweet, and Nothing But the Tweet (Chapter 5)\n", "\n", "If you only have 10 seconds...\n", "\n", "Twitter's new API will prevent you from running much of the code from _Mining the Social Web_, and this IPython Notebook shows you how to roll with the changes and adapt as painlessly as possible until an updated printing is available. In particular, it shows you how to authenticate before executing any API requests illustrated in this chapter and how to use the new search API amongst other things. It is highly recommended that you read the IPython Notebook file for Chapter 1 before attempting the examples in this chapter if you haven't already. One of the examples also presumes that you've run an example from Chapter 4 and stored some data in Redis that is recycled into this chapter.\n", "\n", "If you have a couple of minutes...\n", "\n", "Twitter is officially retiring v1.0 of their API as of March 2013 with v1.1 of the API being the new status quo. There are a few fundamental differences that social web miners that should consider (see Twitter's blog at https://dev.twitter.com/blog/changes-coming-to-twitter-api and https://dev.twitter.com/docs/api/1.1/overview) with the two changes that are most likely to affect an existing workflow being that authentication is now mandatory for *all* requests, rate-limiting being on a per resource basis (as opposed to an overall rate limit based on a fixed number of requests per unit time), various platform objects changing (for the better), and search semantics changing to a \"pageless\" approach. All in all, the v1.1 API looks much cleaner and more consistent, and it should be a good thing longer-term although it may cause interim pains for folks migrating to it.\n", "\n", "The latest printing of Mining the Social Web (2012-02-22, Third release) reflects v1.0 of the API, and this document is intended to provide readers with updated examples from Chapter 5 of the book until a new printing provides updates.\n", "\n", "Unlike the IPython Notebook for Chapter 1, there is no filler in this notebook at this time. See the Chapter 1 notebook for a good introduction to using the Twitter API and all that it entails.\n", "\n", "As a reader of my book, I want you to know that I'm committed to helping you in any way that I can, so please reach out on Facebook at https://www.facebook.com/MiningTheSocialWeb or on Twitter at http://twitter.com/SocialWebMining if you have any questions or concerns in the meanwhile. I'd also love your feedback on whether or not you think that IPython Notebook is a good tool for tinkering with the source code for the book, because I'm strongly considering it as a supplement for each chapter.\n", "\n", "Regards - Matthew A. Russell\n", "\n", "\n", "## A Brief Technical Preamble\n", "\n", "* You will need to set your PYTHONPATH environment variable to point to the 'python_code' folder for the GitHub source code when launching this notebook or some of the examples won't work, because they import utility code that's located there\n", "\n", "* Note that this notebook doesn't repeatedly redefine a connection to the Twitter API. It creates a connection one time and resuses it throughout the remainder of the examples in the notebook\n", "\n", "* Arguments that are typically passed in through the command line are hardcoded in the examples for convenience. CLI arguments are typically in ALL_CAPS, so they're easy to spot and change as needed\n", "\n", "* For simplicity, examples that harvest data are limited to small numbers so that it's easier to use experiment with this notebook (given that @timoreilly, the principal subject of the examples, has vast numbers of followers)\n", "\n", "* The parenthetical file names at the end of the captions for the examples correspond to files in the 'python_code' folder of the GitHub repository\n", "\n", "* Just like you'd learn from reading the book, you'll need to have a CouchDB server running because several of the examples in this chapter store and fetch data from it\n", "\n", "* The package twitter_text that is illustrated in some examples for extracting \"tweet entities\" is no longer necessary because the v1.1 API provides tweet entities, but the code still reflects it for compatibility with the current discussion in the book" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-2. Extracting tweet entities with a little help from the twitter_text package (the_tweet__extract_tweet_entities.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#################################################################################\n", "# NOTE: The opt-in \"include_entities\" flag can be passed in as a keyword \n", "# argument to to t.statuses.show to have Twitter's API extract the entities \n", "# instead of using the getEntities function as described in this example like so:\n", "#\n", "# tweet = t.statuses.show(id=TWEET_ID, include_entities=1)\n", "# \n", "# This is a case-in-point of Twitter's API constantly evolving to make the lives\n", "# of developers easier. Their API slowly evolved quite a bit over the course of\n", "# 2010 as Mining the Social Web was being written, and will no doubt continue\n", "# to evolve and obsolete additional examples. Still, however, not all Twitter \n", "# APIs provide an opt-in parameter for extracting tweet entities (as of early\n", "# January 2010 anyway), and it is likely the case that you'll need to perform \n", "# this work manually for histroical or archived data that was collected prior \n", "# to mid- to late-2010 unless 3rd party data providers perform the work for you.\n", "#################################################################################\n", "\n", "import sys\n", "import json\n", "import twitter_text # easy_install twitter-text-py\n", "import twitter\n", "from twitter__login import login\n", "\n", "# Get a tweet id by clicking on status \"Details\" right off of twitter.com. \n", "# For example, http://twitter.com/#!/timoreilly/status/17386521699024896\n", "\n", "TWEET_ID = '17386521699024896' # XXX: IPython Notebook cannot prompt for input\n", "\n", "def getEntities(tweet):\n", "\n", " # Now extract various entities from it and build up a familiar structure\n", "\n", " extractor = twitter_text.Extractor(tweet['text'])\n", "\n", " # Note that the production Twitter API contains a few additional fields in\n", " # the entities hash that would require additional API calls to resolve\n", "\n", " entities = {}\n", " entities['user_mentions'] = []\n", " for um in extractor.extract_mentioned_screen_names_with_indices():\n", " entities['user_mentions'].append(um)\n", "\n", " entities['hashtags'] = []\n", " for ht in extractor.extract_hashtags_with_indices():\n", "\n", " # massage field name to match production twitter api\n", "\n", " ht['text'] = ht['hashtag']\n", " del ht['hashtag']\n", " entities['hashtags'].append(ht)\n", "\n", " entities['urls'] = []\n", " for url in extractor.extract_urls_with_indices():\n", " entities['urls'].append(url)\n", "\n", " return entities\n", "\n", "\n", "# Fetch a tweet using an API method of your choice and mixin the entities\n", "\n", "t = twitter.Twitter(domain='api.twitter.com', api_version='1.1')\n", "\n", "tweet = t.statuses.show(id=TWEET_ID)\n", "\n", "tweet['entities'] = getEntities(tweet)\n", "\n", "print json.dumps(tweet, indent=4)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-3. Harvesting tweets from a user or public timeline (the_tweet__harvest_timeline.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import time\n", "import twitter\n", "import couchdb\n", "from couchdb.design import ViewDefinition\n", "from twitter__login import login\n", "from twitter__util import makeTwitterRequest\n", "from twitter__util import getNextQueryMaxIdParam\n", "\n", "\n", "TIMELINE_NAME = 'user' # XXX: IPython Notebook cannot prompt for input\n", "MAX_PAGES = 2 # XXX: IPython Notebook cannot prompt for input\n", "USER = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "KW = { # For the Twitter API call\n", " 'count': 200,\n", " 'trim_user': 'true',\n", " 'include_rts' : 'true',\n", " 'since_id' : 1,\n", " }\n", "\n", "if TIMELINE_NAME == 'user':\n", " USER = sys.argv[3]\n", " KW['screen_name'] = USER\n", "if TIMELINE_NAME == 'home' and MAX_PAGES > 4:\n", " MAX_PAGES = 4\n", "if TIMELINE_NAME == 'user' and MAX_PAGES > 16:\n", " MAX_PAGES = 16\n", "\n", "t = login()\n", "\n", "# Establish a connection to a CouchDB database\n", "server = couchdb.Server('http://localhost:5984')\n", "DB = 'tweets-%s-timeline' % (TIMELINE_NAME, )\n", "\n", "if USER:\n", " DB = '%s-%s' % (DB, USER)\n", "\n", "try:\n", " db = server.create(DB)\n", "except couchdb.http.PreconditionFailed, e:\n", "\n", " # Already exists, so append to it, keeping in mind that duplicates could occur\n", "\n", " db = server[DB]\n", "\n", " # Try to avoid appending duplicate data into the system by only retrieving tweets \n", " # newer than the ones already in the system. A trivial mapper/reducer combination \n", " # allows us to pull out the max tweet id which guards against duplicates for the \n", " # home and user timelines. This is best practice for the Twitter v1.1 API\n", " # See https://dev.twitter.com/docs/working-with-timelines\n", "\n", "\n", " def idMapper(doc):\n", " yield (None, doc['id'])\n", "\n", "\n", " def maxFindingReducer(keys, values, rereduce):\n", " return max(values)\n", "\n", "\n", " view = ViewDefinition('index', 'max_tweet_id', idMapper, maxFindingReducer,\n", " language='python')\n", " view.sync(db)\n", "\n", " KW['since_id'] = int([_id for _id in db.view('index/max_tweet_id')][0].value)\n", "\n", "api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline')\n", "tweets = makeTwitterRequest(api_call, **KW)\n", "db.update(tweets, all_or_nothing=True)\n", "print 'Fetched %i tweets' % len(tweets)\n", "\n", "page_num = 1\n", "while page_num < MAX_PAGES and len(tweets) > 0:\n", "\n", " # Necessary for traversing the timeline in Twitter's v1.1 API.\n", " # See https://dev.twitter.com/docs/working-with-timelines\n", " KW['max_id'] = getNextQueryMaxIdParam(tweets)\n", "\n", " api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline')\n", " tweets = makeTwitterRequest(api_call, **KW)\n", " db.update(tweets, all_or_nothing=True)\n", " print 'Fetched %i tweets' % len(tweets)\n", " page_num += 1" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-4. Extracting entities from tweets and performing simple frequency analysis (the_tweet__count_entities_in_tweets.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Note: The Twitter v1.1 API includes tweet entities by default, so the use of the\n", "# twitter_text package for parsing out tweet entities in this chapter is no longer\n", "# relevant, but included for continuity with the text of the book.\n", "\n", "import sys\n", "import couchdb\n", "from couchdb.design import ViewDefinition\n", "from prettytable import PrettyTable\n", "\n", "DB = 'tweets-user-timeline-timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "server = couchdb.Server('http://localhost:5984')\n", "db = server[DB]\n", "\n", "FREQ_THRESHOLD = 3 # XXX: IPython Notebook cannot prompt for input\n", "\n", "\n", "# Map entities in tweets to the docs that they appear in\n", "\n", "def entityCountMapper(doc):\n", " if not doc.get('entities'):\n", " import twitter_text\n", "\n", " def getEntities(tweet):\n", "\n", " # Now extract various entities from it and build up a familiar structure\n", "\n", " extractor = twitter_text.Extractor(tweet['text'])\n", "\n", " # Note that the production Twitter API contains a few additional fields in\n", " # the entities hash that would require additional API calls to resolve\n", "\n", " entities = {}\n", " entities['user_mentions'] = []\n", " for um in extractor.extract_mentioned_screen_names_with_indices():\n", " entities['user_mentions'].append(um)\n", "\n", " entities['hashtags'] = []\n", " for ht in extractor.extract_hashtags_with_indices():\n", "\n", " # Massage field name to match production twitter api\n", "\n", " ht['text'] = ht['hashtag']\n", " del ht['hashtag']\n", " entities['hashtags'].append(ht)\n", "\n", " entities['urls'] = []\n", " for url in extractor.extract_urls_with_indices():\n", " entities['urls'].append(url)\n", "\n", " return entities\n", "\n", " doc['entities'] = getEntities(doc)\n", "\n", " if doc['entities'].get('user_mentions'):\n", " for user_mention in doc['entities']['user_mentions']:\n", " yield ('@' + user_mention['screen_name'].lower(), [doc['_id'], doc['id']])\n", " if doc['entities'].get('hashtags'):\n", " for hashtag in doc['entities']['hashtags']:\n", " yield ('#' + hashtag['text'], [doc['_id'], doc['id']])\n", " if doc['entities'].get('urls'):\n", " for url in doc['entities']['urls']:\n", " yield (url['url'], [doc['_id'], doc['id']])\n", "\n", "\n", "def summingReducer(keys, values, rereduce):\n", " if rereduce:\n", " return sum(values)\n", " else:\n", " return len(values)\n", "\n", "\n", "view = ViewDefinition('index', 'entity_count_by_doc', entityCountMapper,\n", " reduce_fun=summingReducer, language='python')\n", "view.sync(db)\n", "\n", "# Print out a nicely formatted table. Sorting by value in the client is cheap and easy\n", "# if you're dealing with hundreds or low thousands of tweets\n", "\n", "entities_freqs = sorted([(row.key, row.value) for row in\n", " db.view('index/entity_count_by_doc', group=True)],\n", " key=lambda x: x[1], reverse=True)\n", "\n", "field_names = ['Entity', 'Count']\n", "pt = PrettyTable(field_names=field_names)\n", "pt.align = 'l'\n", "\n", "for (entity, freq) in entities_freqs:\n", " if freq > FREQ_THRESHOLD:\n", " pt.add_row([entity, freq])\n", "\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-5. Finding @mention tweet entities that are also friends (the_tweet__how_many_user_entities_are_friends.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "import redis\n", "import couchdb\n", "import sys\n", "from twitter__util import getRedisIdByScreenName\n", "from twitter__util import getRedisIdByUserId\n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "THRESHOLD = 15 # XXX: IPython Notebook cannot prompt for input\n", "\n", "# Connect using default settings for localhost\n", "\n", "r = redis.Redis()\n", "\n", "# Compute screen_names for friends\n", "\n", "friend_ids = r.smembers(getRedisIdByScreenName(SCREEN_NAME, 'friend_ids'))\n", "friend_screen_names = []\n", "for friend_id in friend_ids:\n", " try:\n", " friend_screen_names.append(json.loads(r.get(getRedisIdByUserId(friend_id,\n", " 'info.json')))['screen_name'].lower())\n", " except TypeError, e:\n", " continue # not locally available in Redis - look it up or skip it\n", "\n", "# Pull the list of (entity, frequency) tuples from CouchDB\n", "\n", "server = couchdb.Server('http://localhost:5984')\n", "db = server['tweets-user-timeline-' + SCREEN_NAME]\n", "\n", "entities_freqs = sorted([(row.key, row.value) for row in\n", " db.view('index/entity_count_by_doc', group=True)],\n", " key=lambda x: x[1])\n", "\n", "# Keep only user entities with insufficient frequencies\n", "\n", "user_entities = [(ef[0])[1:] for ef in entities_freqs if ef[0][0] == '@'\n", " and ef[1] >= THRESHOLD]\n", "\n", "# Do a set comparison\n", "\n", "entities_who_are_friends = \\\n", " set(user_entities).intersection(set(friend_screen_names))\n", "\n", "entities_who_are_not_friends = \\\n", " set(user_entities).difference(entities_who_are_friends)\n", "\n", "print 'Number of user entities in tweets: %s' % (len(user_entities), )\n", "print 'Number of user entities in tweets who are friends: %s' \\\n", " % (len(entities_who_are_friends), )\n", "for e in entities_who_are_friends:\n", " print '\\t' + e\n", "print 'Number of user entities in tweets who are not friends: %s' \\\n", " % (len(entities_who_are_not_friends), )\n", "for e in entities_who_are_not_friends:\n", " print '\\t' + e" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-7. Using couchdb-lucene to query tweet data (the_tweet__couchdb_lucene.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import httplib\n", "from urllib import quote\n", "import json\n", "import couchdb\n", "\n", "DB = 'tweets-user-timeline-timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "QUERY = 'data' # XXX: IPython Notebook cannot prompt for input\n", "\n", "# The body of a JavaScript-based design document we'll create\n", "\n", "dd = \\\n", " {'fulltext': {'by_text': {'index': '''function(doc) { \n", " var ret=new Document(); \n", " ret.add(doc.text); \n", " return ret \n", " }'''}}}\n", "\n", "try:\n", " server = couchdb.Server('http://localhost:5984')\n", " db = server[DB]\n", "except couchdb.http.ResourceNotFound, e:\n", " print \"\"\"CouchDB database '%s' not found. \n", "Please check that the database exists and try again.\"\"\" % DB\n", " sys.exit(1)\n", "\n", "try:\n", " conn = httplib.HTTPConnection('localhost', 5984)\n", " conn.request('GET', '/%s/_design/lucene' % (DB, ))\n", " response = conn.getresponse()\n", "finally:\n", " conn.close()\n", "\n", "# If the design document did not exist create one that'll be\n", "# identified as \"_design/lucene\". The equivalent of the following \n", "# in a terminal:\n", "# $ curl -X PUT http://localhost:5984/DB/_design/lucene -d @dd.json\n", "if response.status == 404:\n", " try:\n", " conn = httplib.HTTPConnection('localhost', 5984)\n", " conn.request('PUT', '/%s/_design/lucene' % (DB, ), json.dumps(dd))\n", " response = conn.getresponse()\n", " \n", " if response.status != 201:\n", " print 'Unable to create design document: %s %s' % (response.status,\n", " response.reason)\n", " sys.exit(1)\n", " finally:\n", " conn.close()\n", "\n", "# Querying the design document is nearly the same as usual except that you reference\n", "# couchdb-lucene's _fti HTTP handler\n", "# $ curl http://localhost:5984/DB/_fti/_design/lucene/by_subject?q=QUERY\n", "\n", "try:\n", " conn.request('GET', '/%s/_fti/_design/lucene/by_text?q=%s' % (DB,\n", " quote(QUERY)))\n", " response = conn.getresponse()\n", " if response.status == 200:\n", " response_body = json.loads(response.read())\n", " else:\n", " print 'An error occurred fetching the response: %s %s' \\\n", " % (response.status, response.reason)\n", " print 'Make sure your couchdb-lucene server is running.'\n", " sys.exit(1)\n", "finally:\n", " conn.close()\n", "\n", "doc_ids = [row['id'] for row in response_body['rows']]\n", "\n", "# pull the tweets from CouchDB and extract the text for display\n", "\n", "tweets = [db.get(doc_id)['text'] for doc_id in doc_ids]\n", "for tweet in tweets:\n", " print tweet\n", " print" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-9. Reconstructing tweet discussion threads (the_tweet__reassemble_discussion_thread.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import httplib\n", "from urllib import quote\n", "import json\n", "import couchdb\n", "from twitter__login import login\n", "from twitter__util import makeTwitterRequest\n", "\n", "DB = 'tweets-user-timeline-timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "USER = 'n2vip' # XXX: IPython Notebook cannot prompt for input\n", "\n", "try:\n", " server = couchdb.Server('http://localhost:5984')\n", " db = server[DB]\n", "except couchdb.http.ResourceNotFound, e:\n", " print >> sys.stderr, \"\"\"CouchDB database '%s' not found. \n", "Please check that the database exists and try again.\"\"\" % DB\n", " sys.exit(1)\n", "\n", "# query by term\n", "\n", "try:\n", " conn = httplib.HTTPConnection('localhost', 5984)\n", " conn.request('GET', '/%s/_fti/_design/lucene/by_text?q=%s' % (DB,\n", " quote(USER)))\n", " response = conn.getresponse()\n", " if response.status == 200:\n", " response_body = json.loads(response.read())\n", " else:\n", " print >> sys.stderr, 'An error occurred fetching the response: %s %s' \\\n", " % (response.status, response.reason)\n", " sys.exit(1)\n", "finally:\n", " conn.close()\n", "\n", "doc_ids = [row['id'] for row in response_body['rows']]\n", "\n", "# pull the tweets from CouchDB\n", "\n", "tweets = [db.get(doc_id) for doc_id in doc_ids]\n", "\n", "# mine out the in_reply_to_status_id_str fields and fetch those tweets as a batch request\n", "\n", "conversation = sorted([(tweet['_id'], int(tweet['in_reply_to_status_id_str']))\n", " for tweet in tweets if tweet['in_reply_to_status_id_str']\n", " is not None], key=lambda x: x[1])\n", "min_conversation_id = min([int(i[1]) for i in conversation if i[1] is not None])\n", "max_conversation_id = max([int(i[1]) for i in conversation if i[1] is not None])\n", "\n", "# Pull tweets from other user using user timeline API to minimize API expenses...\n", "\n", "t = login()\n", "\n", "reply_tweets = []\n", "results = []\n", "page = 1\n", "while True:\n", " results = makeTwitterRequest(t.statuses.user_timeline,\n", " count=200,\n", " # Per , some\n", " # caveats apply with the oldest id you can fetch using \"since_id\"\n", " since_id=min_conversation_id,\n", " max_id=max_conversation_id,\n", " skip_users='true',\n", " screen_name=USER,\n", " page=page)\n", " reply_tweets += results\n", " page += 1\n", " if len(results) == 0: \n", " break\n", "\n", "# During testing, it was observed that some tweets may not resolve or possibly\n", "# even come back with null id values -- possibly a temporary fluke. Workaround.\n", "missing_tweets = []\n", "for (doc_id, in_reply_to_id) in conversation:\n", " try:\n", " print [rt for rt in reply_tweets if rt['id'] == in_reply_to_id][0]['text']\n", " except Exception, e:\n", " print >> sys.stderr, 'Refetching <>' % (in_reply_to_id, )\n", " results = makeTwitterRequest(t.statuses.show, id=in_reply_to_id)\n", " print results['text']\n", "\n", " # These tweets are already on hand\n", " print db.get(doc_id)['text']\n", " print" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-11. Counting the number of times Twitterers have been retweeted by someone (the_tweet__count_retweets_of_other_users.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Note: As pointed out in the text, there are now additional/better ways to process retweets\n", "# as the Twitter API has evolved. In particular, take a look at the retweet_count field of the\n", "# status object. See https://dev.twitter.com/docs/platform-objects/tweets. However, the technique\n", "# illustrated in this code is still relevant as some Twitter clients may not follow best practices\n", "# and still use the \"RT\" or \"via\" conventions to tweet as opposed to using the Twitter API to issue\n", "# a retweet.\n", "\n", "import sys\n", "import couchdb\n", "from couchdb.design import ViewDefinition\n", "from prettytable import PrettyTable\n", "\n", "DB = 'tweets-user-timeline-timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "FREQ_THRESHOLD = 3 # XXX: IPython Notebook cannot prompt for input\n", "\n", "try:\n", " server = couchdb.Server('http://localhost:5984')\n", " db = server[DB]\n", "except couchdb.http.ResourceNotFound, e:\n", " print \"\"\"CouchDB database '%s' not found. \n", "Please check that the database exists and try again.\"\"\" % DB\n", " sys.exit(1)\n", "\n", "# Map entities in tweets to the docs that they appear in\n", "\n", "def entityCountMapper(doc):\n", " if doc.get('text'):\n", " import re\n", " m = re.search(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", doc['text'])\n", " if m:\n", " entities = m.groups()[1].split()\n", " for entity in entities:\n", " yield (entity.lower(), [doc['_id'], doc['id']])\n", " else:\n", " yield ('@', [doc['_id'], doc['id']])\n", "\n", "\n", "def summingReducer(keys, values, rereduce):\n", " if rereduce:\n", " return sum(values)\n", " else:\n", " return len(values)\n", "\n", "\n", "view = ViewDefinition('index', 'retweet_entity_count_by_doc', entityCountMapper,\n", " reduce_fun=summingReducer, language='python')\n", "view.sync(db)\n", "\n", "# Sorting by value in the client is cheap and easy\n", "# if you're dealing with hundreds or low thousands of tweets\n", "\n", "entities_freqs = sorted([(row.key, row.value) for row in\n", " db.view('index/retweet_entity_count_by_doc',\n", " group=True)], key=lambda x: x[1], reverse=True)\n", "\n", "field_names = ['Entity', 'Count']\n", "pt = PrettyTable(field_names=field_names)\n", "pt.align = 'l'\n", "\n", "for (entity, freq) in entities_freqs:\n", " if freq > FREQ_THRESHOLD and entity != '@':\n", " pt.add_row([entity, freq])\n", "\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-12. Finding the tweets that have been retweeted most often (the_tweet_count_retweets_by_others.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import couchdb\n", "from couchdb.design import ViewDefinition\n", "from prettytable import PrettyTable\n", "from twitter__util import pp\n", "\n", "DB = 'tweets-user-timeline-timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "try:\n", " server = couchdb.Server('http://localhost:5984')\n", " db = server[DB]\n", "except couchdb.http.ResourceNotFound, e:\n", " print \"\"\"CouchDB database '%s' not found. \n", "Please check that the database exists and try again.\"\"\" % DB\n", " sys.exit(1)\n", "\n", "# Map entities in tweets to the docs that they appear in\n", "\n", "def retweetCountMapper(doc):\n", " if doc.get('id') and doc.get('text'):\n", " yield (doc['retweet_count'], 1)\n", "\n", "def summingReducer(keys, values, rereduce):\n", " return sum(values)\n", "\n", "view = ViewDefinition('index', 'retweets_by_id', retweetCountMapper, \n", " reduce_fun=summingReducer, language='python')\n", "\n", "view.sync(db)\n", "\n", "field_names = ['Num Tweets', 'Retweet Count']\n", "pt = PrettyTable(field_names=field_names)\n", "pt.align = 'l'\n", "\n", "retweet_total, num_tweets, num_zero_retweets = 0, 0, 0\n", "for (k,v) in sorted([(row.key, row.value) for row in \n", " db.view('index/retweets_by_id', group=True)\n", " if row.key is not None],\n", " key=lambda x: x[0], reverse=True):\n", " pt.add_row([k, v])\n", "\n", " if k == \"100+\":\n", " retweet_total += 100*v\n", " elif k == 0:\n", " num_zero_retweets += v\n", " else:\n", " retweet_total += k*v\n", "\n", " num_tweets += v\n", "\n", "print pt\n", "\n", "print '\\n%s of %s authored tweets were retweeted at least once' % \\\n", " (pp(num_tweets - num_zero_retweets), pp(num_tweets),)\n", "print '\\t(%s tweet/retweet ratio)\\n' % \\\n", " (1.0*(num_tweets - num_zero_retweets)/num_tweets,)\n", "\n", "print 'Those %s authored tweets generated %s retweets' % (pp(num_tweets), pp(retweet_total),)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 5-13. Counting hashtag entities in tweets (the_tweet__avg_hashtags_per_tweet.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import couchdb\n", "from couchdb.design import ViewDefinition\n", "\n", "DB = 'tweets-user-timeline-timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "try:\n", " server = couchdb.Server('http://localhost:5984')\n", " db = server[DB]\n", "except couchdb.http.ResourceNotFound, e:\n", " print \"\"\"CouchDB database '%s' not found. \n", "Please check that the database exists and try again.\"\"\" % DB\n", " sys.exit(1)\n", "\n", "# Emit the number of hashtags in a document\n", "\n", "def entityCountMapper(doc):\n", " if not doc.get('entities'):\n", " import twitter_text\n", "\n", " def getEntities(tweet):\n", "\n", " # Now extract various entities from it and build up a familiar structure\n", "\n", " extractor = twitter_text.Extractor(tweet['text'])\n", "\n", " # Note that the production Twitter API contains a few additional fields in\n", " # the entities hash that would require additional API calls to resolve\n", "\n", " entities = {}\n", " entities['user_mentions'] = []\n", " for um in extractor.extract_mentioned_screen_names_with_indices():\n", " entities['user_mentions'].append(um)\n", "\n", " entities['hashtags'] = []\n", " for ht in extractor.extract_hashtags_with_indices():\n", "\n", " # Massage field name to match production twitter api\n", "\n", " ht['text'] = ht['hashtag']\n", " del ht['hashtag']\n", " entities['hashtags'].append(ht)\n", "\n", " entities['urls'] = []\n", " for url in extractor.extract_urls_with_indices():\n", " entities['urls'].append(url)\n", "\n", " return entities\n", "\n", " doc['entities'] = getEntities(doc)\n", "\n", " if doc['entities'].get('hashtags'):\n", " yield (None, len(doc['entities']['hashtags']))\n", "\n", "\n", "def summingReducer(keys, values, rereduce):\n", " return sum(values)\n", "\n", "\n", "view = ViewDefinition('index', 'count_hashtags', entityCountMapper,\n", " reduce_fun=summingReducer, language='python')\n", "view.sync(db)\n", "\n", "num_hashtags = [row for row in db.view('index/count_hashtags')][0].value\n", "\n", "# Now, count the total number of tweets that aren't direct replies\n", "\n", "def entityCountMapper(doc):\n", " if doc.get('text')[0] == '@':\n", " yield (None, 0)\n", " else:\n", " yield (None, 1)\n", "\n", "\n", "view = ViewDefinition('index', 'num_docs', entityCountMapper,\n", " reduce_fun=summingReducer, language='python')\n", "view.sync(db)\n", "\n", "num_docs = [row for row in db.view('index/num_docs')][0].value\n", "\n", "# Finally, compute the average\n", "\n", "print 'Avg number of hashtags per tweet for %s: %s' % \\\n", " (DB.split('-')[-1], 1.0 * num_hashtags / num_docs,)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exmaple 5-14. Harvesting tweets for a given query (the_tweet__search.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import twitter\n", "import couchdb\n", "from couchdb.design import ViewDefinition\n", "from twitter__util import makeTwitterRequest\n", "from twitter__login import login\n", "\n", "Q = 'OpenGov' # XXX: IPython Notebook cannot accept input\n", "MAX_PAGES = 5\n", "\n", "server = couchdb.Server('http://localhost:5984')\n", "DB = 'search-%s' % (Q.lower().replace('#', '').replace('@', ''), )\n", "\n", "t = login()\n", "search_results = t.search.tweets(q=Q, count=100)\n", "tweets = search_results['statuses']\n", "\n", "for _ in range(MAX_PAGES-1): # Get more pages\n", " next_results = search_results['search_metadata']['next_results']\n", "\n", " # Create a dictionary from the query string params\n", " kwargs = dict([ kv.split('=') for kv in next_results[1:].split(\"&\") ]) \n", "\n", " search_results = t.search.tweets(**kwargs)\n", " tweets += search_results['statuses']\n", "\n", " if len(search_results['statuses']) == 0:\n", " break\n", "\n", " print 'Fetched %i tweets so far' % (len(tweets),)\n", "\n", "# Store the data\n", "try:\n", " db = server.create(DB)\n", "except couchdb.http.PreconditionFailed, e:\n", " # Already exists, so append to it (but be mindful of appending duplicates with repeat searches.)\n", " # The refresh_url in the search_metadata or streaming API might also be\n", " # appropriate to use here.\n", " db = server[DB]\n", "\n", "db.update(tweets, all_or_nothing=True)\n", "print 'Done. Stored data to CouchDB - http://localhost:5984/_utils/database.html?%s' % (DB,)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }