{ "metadata": { "name": "04-bloom-filters" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "See http://ivory.idyll.org/blog/2013-pycon-awesome-big-data-algorithms-talk.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IPython Blocks\n", "\n", "Below, I will use IPython Blocks (https://github.com/jiffyclub/ipythonblocks) to demonstrate Bloom filters.\n", "\n", "IPython Blocks is a nifty little visualization tool built by Matt Davis @jiffyclub for use in teaching Python and programming basics. Here's a quick little demo --" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from ipythonblocks import BlockGrid" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "grid = BlockGrid(10,10, fill=(0,0,128))\n", "\n", "x = grid.shape[0]\n", "y = grid.shape[1]\n", "\n", "for block in grid:\n", " r = block.row * 255 / float(x)\n", " g = block.col * 255 / float(y)\n", " block.red = r\n", " block.green = g\n", " \n", "grid" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
" ], "output_type": "pyout", "prompt_number": 2, "text": [ "" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bloom filters\n", "\n", "I'll implement Bloom filters using multiple individual hash tables. (The canonical way to implement them is to use multiple hash functions with one big hash table, but I find that a bit harder to understand.)\n", "\n", "First, let's build a simple hash table object that doesn't track collisions. Note, to get 'num' from a string you'd just use a hash function." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import math\n", "\n", "class Hash(object):\n", " def __init__(self, size):\n", " self.size = size\n", " self.bits = [0]*size\n", " \n", " def add(self, num):\n", " num = num % self.size\n", " self.bits[num] = 1\n", " \n", " def get(self, num):\n", " num = num % self.size\n", " return self.bits[num]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, build a full Bloom filter that creates multiple hash tables; here, 'add' inserts the element into each hash table, and 'get' checks to see if the element is in all the hash tables.\n", "\n", "I've also included three utility methods: fp(), empirical_fp(), and show(). The first calculates the predicted false positive rate based on hash table occupancy; the second calculates the actual false positive rate for a range of numbers; and the third shows the hash table occupancy using IPython Blocks. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "class BloomFilter(object):\n", " def __init__(self, *sizes):\n", " self.hashes = [ Hash(size) for size in sizes ]\n", " \n", " def add(self, num):\n", " for h in self.hashes:\n", " h.add(num)\n", " \n", " def get(self, num):\n", " for h in self.hashes:\n", " if not h.get(num):\n", " return 0\n", " return 1\n", " \n", " def fp(self):\n", " total = 0.\n", " for h in self.hashes:\n", " occupancy = sum(h.bits)\n", " f = occupancy / float(h.size)\n", " total += math.log(f, 2)\n", " \n", " return 2**total\n", " \n", " def empirical_fp(self, actual, max):\n", " found_true = 0\n", " found_false = 0\n", " for i in range(max):\n", " if self.get(i):\n", " if i in actual:\n", " found_true += 1\n", " else:\n", " found_false += 1\n", " \n", " return found_false / float(max)\n", " \n", " \n", " def show(self):\n", " rows = len(self.hashes)\n", " cols = max([ h.size for h in self.hashes ])\n", " grid = BlockGrid(cols, rows, fill=(0,0,0))\n", " for i, h in enumerate(self.hashes):\n", " for pos in range(h.size, cols):\n", " grid[i, pos] = (255, 255, 255)\n", " for j, bit in enumerate(h.bits):\n", " if bit:\n", " grid[i, j] = (255, 0, 0)\n", " return grid.show()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a Bloom filter with three hash tables, size 5, 7, and 11, and then show the occupied cells in the three hash tables after adding '253' and '8132' (no special significance to the numbers)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = BloomFilter(5, 7, 11)\n", "x.show()\n", "x.add(253)\n", "x.show()\n", "print x.get(253)\n", "\n", "###\n", "\n", "x.add(8132)\n", "x.show()\n", "print x.get(8132)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "1\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "1\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's check out what happens when you start filling up the hash tables with lots and lots of entries." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import random, time\n", "\n", "x = BloomFilter(5, 7, 11)\n", "actual = set()\n", "for _ in range(10):\n", " num = random.randint(0, 255)\n", " actual.add(num)\n", " x.add(num)\n", " x.show()\n", " theory, empirical = x.fp(), x.empirical_fp(actual, 255)\n", " print 'inserting', num\n", " print 'predicted FP:', theory, 'diff from actual:', abs(theory-empirical)\n", " time.sleep(1)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 2\n", "predicted FP: 0.0025974025974 diff from actual: 0.0025974025974\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 49\n", "predicted FP: 0.0207792207792 diff from actual: 0.00509294626942\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 141\n", "predicted FP: 0.0701298701299 diff from actual: 0.0034632034632\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 151\n", "predicted FP: 0.124675324675 diff from actual: 0.0187929717341\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 26\n", "predicted FP: 0.194805194805 diff from actual: 0.0183346065699\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 54\n", "predicted FP: 0.233766233766 diff from actual: 0.0220015278839\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 0\n", "predicted FP: 0.363636363636 diff from actual: 0.0185383244207\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 96\n", "predicted FP: 0.363636363636 diff from actual: 0.0224598930481\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 248\n", "predicted FP: 0.623376623377 diff from actual: 0.0390628978864\n" ] }, { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "inserting 54\n", "predicted FP: 0.623376623377 diff from actual: 0.0390628978864\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-sided error\n", "\n", "Bloom filters have what's called one-sided error: they may report elements that were never inserted as being members of the set -- false positives -- but they will never miss reporting elements that WERE inserted as being members -- false negatives. This is a straightforward consequence of the \"no collision tracking\" aspect of the hash tables: collisions lead to false reporting, but you never miss something you've already inserted.\n", "\n", "If you know the hash table sizes, it's easy to predict the collisions. One simple way is to test something that's 5*7*11 times added to something you've already inserted, i.e. to force a collision based on the modulus of the hash table sizes." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# here, we add '3' and test for '3 + 5*7*11', or 388.\n", "x = BloomFilter(5, 7, 11)\n", "x.add(3)\n", "x.show()\n", "\n", "print x.get(3), x.get(3 + (5*7*11))\n", "print \"oh noes! 388 is falsely said to be present in the data set!\"" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
" ], "output_type": "display_data", "text": [ "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "1 1\n", "oh noes! 388 is falsely said to be present in the data set!\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 } ], "metadata": {} } ] }