{ "metadata": { "name": "03-hyper-log-log-counter" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "See http://ivory.idyll.org/blog/2013-pycon-awesome-big-data-algorithms-talk.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# HyperLogLog counter\n", "\n", "The HyperLogLog counter uses bit patterns in hash functions to estimate \n", "\n", "Read:\n", " \n", "http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/\n", "\n", "The code was mostly taken from\n", "\n", "https://github.com/svpcom/hyperloglog\n", "\n", "which is Vasily Evseenko's rewrite of Nelson Gon\u00e7alves's Log-Log-Sketch repo (https://github.com/goncalvesnelson/). I'm much indebted to them!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# derived constants etc. These are just utility functions that you can ignore.\n", "\n", "def _get_alpha(b):\n", " if not (4 <= b <= 16):\n", " raise ValueError(\"b=%d should be in range [4 : 16]\" % b)\n", "\n", " if b == 4:\n", " return 0.673\n", "\n", " if b == 5:\n", " return 0.697\n", "\n", " if b == 6:\n", " return 0.709\n", "\n", " return 0.7213 / (1.0 + 1.079 / (1 << b))\n", "\n", "def estimate_cardinality(alpha, bits, bins):\n", " # harmonic mean\n", " E = alpha * float(len(bins) ** 2) / sum(math.pow(2.0, -x) for x in bins)\n", " \n", " if E <= 2.5 * bits: # Small range correction \n", " V = bins.count(0) #count number or registers equal to 0\n", " return bits * math.log(bins/ float(V)) if V > 0 else E\n", " elif E <= float(1L << 160) / 30.0:\n", " return E\n", " else:\n", " return -(1L << 160) * math.log(1.0 - E / (1L << 160))\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 148 }, { "cell_type": "code", "collapsed": false, "input": [ "# choose the precision by choosing how many estimators to track.\n", "bits = 8\n", "alpha = _get_alpha(bits)\n", "num_bins = 1 << bits\n", "bit_bins = [ 1L << i for i in range(160 - bits + 1) ]\n", "\n", "print 'num bins:', num_bins" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "num bins: 256\n" ] } ], "prompt_number": 149 }, { "cell_type": "code", "collapsed": false, "input": [ "# 'rho' function to calculate the bit pattern to watch (string of 0s)\n", "import bisect\n", "\n", "# here, 'rho' is the number of 0s to the left of the first 'accuracy' bits.\n", "def rho(w):\n", " r = len(bit_bins) - bisect.bisect_right(bit_bins, w)\n", " return r" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 150 }, { "cell_type": "code", "collapsed": false, "input": [ "print 1, len(bit_bins) - rho(1)\n", "print 2, len(bit_bins) - rho(2)\n", "print 3, len(bit_bins) - rho(3)\n", "print 4, len(bit_bins) - rho(4)\n", "print 8, len(bit_bins) - rho(8)\n", "print 2**152, len(bit_bins) - rho(2**152)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 1\n", "2 2\n", "3 2\n", "4 3\n", "8 4\n", "5708990770823839524233143877797980545530986496 153\n" ] } ], "prompt_number": 151 }, { "cell_type": "code", "collapsed": false, "input": [ "print 'initializing', num_bins, 'estimators'\n", "estimators = [0]*num_bins" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "initializing 256 estimators\n" ] } ], "prompt_number": 152 }, { "cell_type": "code", "collapsed": false, "input": [ "from hashlib import sha1\n", "\n", "# to add a number into the counter:\n", "def add(num):\n", " # take the hash of 'num'\n", " num = str(num)\n", " hash = long(sha1(num).hexdigest(), 16)\n", " \n", " # here, 'bin' is determined by the first 'bits' bits of hash\n", " bin = hash & ((1 << bits) - 1)\n", " \n", " # now count the number of 0s in the remaining bits\n", " remaining_bits = hash >> bits\n", " count = rho(remaining_bits)\n", " \n", " # take max of currently stored estimation & this one\n", " estimators[bin] = max(estimators[bin], count)\n", " \n", "import random\n", "for i in range(100000):\n", " num = random.randint(0, int(1e9))\n", " add(num)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 153 }, { "cell_type": "code", "collapsed": false, "input": [ "print 'estimate cardinality as', estimate_cardinality(alpha, bits, estimators)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "estimate cardinality as 98837.2382097\n" ] } ], "prompt_number": 154 }, { "cell_type": "code", "collapsed": false, "input": [ "import random\n", "for i in range(100000):\n", " num = random.randint(0, int(1e9))\n", " add(num)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 155 }, { "cell_type": "code", "collapsed": false, "input": [ "print 'estimate cardinality as', estimate_cardinality(alpha, bits, estimators)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "estimate cardinality as 187751.540573\n" ] } ], "prompt_number": 156 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }