{ "metadata": { "name": "recitation10" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS1001.py\n", "\n", "## Extended Introduction to Computer Science with Python, Tel-Aviv University, Spring 2013\n", "\n", "# Recitation 10 - 16-20.5.2013\n", "\n", "## Last update: 16.5.2013" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rabin-Karp algorithm for string matching\n", "\n", "We have a text of length *n* and a pattern of length *m* and we want to find all occurences of the pattern in the text.\n", "\n", "Naively we would compare all *n-m* substrings in the text to the pattern, but this will have a complexity of $O((n-m)m)=O(nm)$.\n", "\n", "We want to improve this using a *hashtable* (as was done for finding repeating substrings in [recitation 8](http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb). Then we can calculate the hash of the pattern and of all substrings in the text and compare the hashes. Calculating each hash, though, takes $O(m) \\;$ and therefore we still have a complexity of $O(mn)$.\n", "\n", "## Rolling hash\n", "\n", "The trick of the **Rabin-Karp** algorithm is using a **rolling hash**:\n", "\n", "if we already calculated the rolling hash of the substring of length *m* that started at position *i*, calculating the rolling hash of the substring of length *m* that started at position *i+1* will take $O(1)$. \n", "\n", "The complexity now will be much better - we need to calculate the hash on the pattern and on the substring that starts at position 0, each with complexity $O(m)$. In addition, we need to calculate the hash on the rest of the *n-m* substrings, each taking $O(1)$ because of the rolling hash. Altogether this is $O(n+m) \\;$, much smaller than $O(nm)$. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def arithmetize(text, basis=2**16, r=2**32-3):\n", " \"\"\" convert substring to number using basis powers\n", " employs Horner method modulo r \"\"\"\n", " partial_sum = 0\n", " for ch in text:\n", " partial_sum = (partial_sum * basis + ord(ch)) % r\n", " return partial_sum" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "ord(\"b\"), ord(\"e\"), ord(\"n\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "(98, 101, 110)" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "arithmetize(\"b\"), arithmetize(\"be\"), arithmetize(\"ben\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 5, "text": [ "(98, 6422629, 6619540)" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "def arithmetize_text_naive(text, m, basis=2**16, r=2**32-3):\n", " \"\"\" computes arithmization of all m long substrings\n", " of text, using basis powers \"\"\"\n", " return [arithmetize(text[i:i + m], basis, r) for i in range(len(text) - m + 1)]\n", "\n", "def arithmetize_text(text, m, basis=2**16, r=2**32-3):\n", " \"\"\" efficiently computes arithmetization of all m long\n", " substrings of text, using basis powers \"\"\"\n", " b_power = basis ** (m - 1)\n", " lst = [None]*(len(text) - m + 1) \n", " # lst[0] equals first word arithmetization\n", " lst[0] = arithmetize(text[0:m], basis, r)\n", " for i in range(1, len(lst)):\n", " rolling_hash = (lst[i - 1] - ord(text[i - 1])* b_power) * basis + ord(text[i + m - 1])\n", " rolling_hash %= r\n", " lst[i] = rolling_hash # append new_number to existing lst\n", " return lst" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "song = '''\u05d0\u05e0\u05d9 \u05db\u05dc \u05db\u05da \u05e2\u05e6\u05d5\u05d1 \u05dc\u05d9 \u05d5\u05e9\u05de\u05e9 \u05e2\u05dc \u05d4\u05e2\u05d9\u05e8\n", "\u05d5\u05d3\u05d9\u05d6\u05e0\u05d2\u05d5\u05e3 \u05e0\u05e8\u05d0\u05d4 \u05dc\u05d9 \u05db\u05de\u05d5 \u05e8\u05db\u05d1\u05ea \u05dc\u05d9\u05dc\u05d4 \u05dc\u05e7\u05d4\u05d9\u05e8 \n", "\u05d1\u05d9\u05df \u05db\u05dc \u05d4\u05e6\u05dc\u05d9\u05dc\u05d9\u05dd \u05de\u05d7\u05e4\u05e9 \u05e1\u05d9\u05de\u05df \n", "\u05d9\u05d5\u05e9\u05d1 \u05d1\u05e6\u05d3 \u05d4\u05d3\u05e8\u05da \n", "\u05d9\u05d5\u05e9\u05d1 \u05dc\u05d9\u05d3 \u05d4\u05d6\u05de\u05df \n", "\n", "\u05d1\u05d7\u05d3\u05e8 \u05d4\u05e1\u05d2\u05d5\u05e8 \u05d1\u05d3\u05d9\u05d3\u05d5\u05ea \u05de\u05e7\u05d9\u05e8 \u05dc\u05e7\u05d9\u05e8 \n", "\u05d5\u05d0\u05dd \u05d0\u05e6\u05d0 \u05d4\u05d7\u05d5\u05e6\u05d4 \u05d4\u05de\u05e6\u05d1 \u05e8\u05e7 \u05d9\u05d7\u05de\u05d9\u05e8 \n", "\u05d0\u05d6 \u05d0\u05e0\u05d9 \u05e9\u05d5\u05de\u05e2 \u05e6\u05dc\u05d9\u05dc \u05de\u05d0\u05d5\u05d3 \u05de\u05d5\u05db\u05e8 \n", "\u05de\u05d2\u05d9\u05e2 \u05de\u05dc\u05de\u05d8\u05d4 \u05e8\u05d7\u05d5\u05e7 \u05de\u05d4\u05db\u05d9\u05db\u05e8'''" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "print(arithmetize_text(song, 1) == arithmetize_text_naive(song, 1))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "print(arithmetize_text(song, 2)[:10])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[97519072, 98567641, 98107424, 2098651, 98239964, 98304032, 2098651, 98239962, 98172960, 2098658]\n" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "print(arithmetize_text(song, 2, 2**16 - 1)[:10])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[97517584, 98566137, 98105927, 2098619, 98238465, 98302532, 2098619, 98238463, 98171462, 2098626]\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 5 arithmetize_text_naive(song, 5)\n", "%timeit -n 5 arithmetize_text(song, 5)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5 loops, best of 3: 1.19 ms per loop\n", "5 loops, best of 3: 417 us per loop\n" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "with urllib.request.urlopen(\"http://www.gutenberg.org/cache/epub/2701/pg2701.txt\") as f:\n", " book = f.read().decode('utf8')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "print(book[:book.index('\\n\\r\\n\\r')])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\ufeffThe Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville\r\n", "\r\n", "This eBook is for the use of anyone anywhere at no cost and with\r\n", "almost no restrictions whatsoever. You may copy it, give it away or\r\n", "re-use it under the terms of the Project Gutenberg License included\r\n", "with this eBook or online at www.gutenberg.org\r\n" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "len(book)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 14, "text": [ "1257258" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)\n", "%timeit -n 3 arithmetize_text_naive(book, 3)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 loops, best of 3: 310 ms per loop\n", "3 loops, best of 3: 3.14 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)\n", "%timeit -n 3 arithmetize_text(book, 3)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 loops, best of 3: 222 ms per loop\n", "3 loops, best of 3: 2.23 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)\n", "%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 5)\n", "%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 10)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 loops, best of 3: 310 ms per loop\n", "3 loops, best of 3: 480 ms per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "3 loops, best of 3: 903 ms per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)\n", "%timeit -n 3 arithmetize_text(book[:len(book)//10], 5)\n", "%timeit -n 3 arithmetize_text(book[:len(book)//10], 10)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 loops, best of 3: 220 ms per loop\n", "3 loops, best of 3: 229 ms per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "3 loops, best of 3: 246 ms per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "def find_matches(pattern, text, basis=2**16, r=2**32-3):\n", " \"\"\" find all occurances of pattern in text\n", " using efficient arithmetization of text \"\"\"\n", " assert len(pattern) <= len(text)\n", " pattern_hash = arithmetize(pattern, basis, r)\n", " text_hashes = arithmetize_text(text, len(pattern), basis, r)\n", " return [i for i,hs in enumerate(text_hashes) if hs == pattern_hash]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "matches = find_matches(\"moby\", book.lower())\n", "print(matches[0], len(matches), book[matches[0]:matches[0] + 4])\n", "matches = find_matches(\"aye, aye, sir\", book.lower())\n", "print(matches[0], len(matches), book[matches[0]:matches[0] + 13])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "32 124 Moby\n", "383330" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 5 Aye, aye, sir\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Safe R-K\n", "\n", "Now we must remember that a substring can have the same hash as the pattern even if it is not equal to it.\n", "\n", "Here is a safe version in which we check that the substring does indeed equals the pattern:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_matches_safe(pattern, text, basis=2**16, r=2**32-3):\n", " \"\"\" find all occurances of pattern in text\n", " using efficient arithmetization of text \"\"\"\n", " assert len(pattern) <= len(text)\n", " pattern_hash = arithmetize(pattern, basis, r)\n", " text_hashes = arithmetize_text(text, len(pattern), basis, r)\n", " matches = [i for i,hs in enumerate(text_hashes) if hs == pattern_hash and text[i:i+len(pattern)] == pattern]\n", " return matches" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "def foo(v,t):\n", " print(t)\n", " return v\n", "print(foo(True,\"foo\") and foo(True,\"bar\"))\n", "print(foo(False,\"foo\") and foo(True,\"bar\"))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "foo\n", "bar\n", "True\n", "foo\n", "False\n" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Worst case\n", "\n", "The worst case is when everything matches:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "text = \"a\" * 10**5\n", "for pattern in [\"a\"*10**2, \"a\"*10**3, \"a\"*10**4]:\n", " for f in [find_matches, find_matches_safe]:\n", " %timeit -n 1 f(pattern, text)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 loops, best of 3: 479 ms per loop\n", "1 loops, best of 3: 567 ms per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 3.09 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 3.33 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 25.9 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 28.9 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But for normal texts there is hardly any difference:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for pattern in [\"moby\", \"aye, aye, sir\", \"moby dick was his name\"]:\n", " for f in [find_matches, find_matches_safe]:\n", " %timeit -n 1 f(pattern, book.lower())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 loops, best of 3: 2.7 s per loop\n", "1 loops, best of 3: 2.69 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 2.9 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 2.88 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 3.08 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 3.1 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choice of `r`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "find_matches(\"da\",\"abracadabra\", basis=2**16, r=2**16)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 26, "text": [ "[2, 4, 6, 9]" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "find_matches_safe(\"da\",\"abracadabra\", basis=2**16, r=2**16)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 27, "text": [ "[6]" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because $b=r$, $mod r$ just takes the rightmost character.\n", "When searching for \"da\" we actually seach for \"?a\".\n", "\n", "`r` better not be a power of the base used for arithmetization of the string. \n", "We prefer that `r` is not a power of small primes.\n", "Large primes are a good choice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternative rolling hash\n", "We will try a different hash function in which the hash is the sum of the characters of the string." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def arithmetize_sum(text, r=2**32-3):\n", " partial_sum = 0\n", " for ch in text:\n", " partial_sum = (partial_sum + ord(ch)) % r\n", " return partial_sum" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "def arithmetize_text_sum(text, m, r=2**32-3):\n", " lst = []\n", " lst.append(arithmetize_sum(text[:m], r))\n", " for i in range(1, len(text) - m + 1):\n", " rolling_hash = (lst[i-1] - ord(text[i-1]) + ord(text[i + m - 1]))\n", " rolling_hash %= r\n", " lst.append(rolling_hash)\n", " return lst " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 3 arithmetize_text(song, 3)\n", "%timeit -n 3 arithmetize_text_sum(song, 3)\n", "%timeit -n 3 arithmetize_text(song, 30)\n", "%timeit -n 3 arithmetize_text_sum(song, 30)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 loops, best of 3: 411 us per loop\n", "3 loops, best of 3: 309 us per loop\n", "3 loops, best of 3: 540 us per loop\n", "3 loops, best of 3: 290 us per loop\n" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 3 arithmetize_text(book, 3)\n", "%timeit -n 3 arithmetize_text_sum(book, 3)\n", "%timeit -n 3 arithmetize_text(book, 10)\n", "%timeit -n 3 arithmetize_text_sum(book, 10)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 loops, best of 3: 2.36 s per loop\n", "3 loops, best of 3: 1.76 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "3 loops, best of 3: 2.66 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "3 loops, best of 3: 1.73 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This rolling hash function is efficient - seems to be at least as good as the previous one if not even faster, probably due to less arithmetics being involved." ] }, { "cell_type": "code", "collapsed": false, "input": [ "arithmetize_sum(\"I am Lord Voldemort\".lower())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 32, "text": [ "1828" ] } ], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [ "arithmetize_sum(\"Tom Marvolo Riddle \".lower())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 33, "text": [ "1828" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem is that **anagrams collide** - permutations of the same string get the same hash value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fin\n", "This notebook is part of the [Extended introduction to computer science](http://tau-cs1001-py.wikidot.com/) course at Tel-Aviv University.\n", "\n", "The notebook was written using Python 3.2 and IPython 0.13.1.\n", "\n", "The code is available at .\n", "\n", "The notebook can be viewed online at .\n", "\n", "This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/)." ] } ], "metadata": {} } ] }