{ "metadata": { "name": "recitation12" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS1001.py\n", "\n", "## Extended Introduction to Computer Science with Python, Tel-Aviv University, Spring 2013\n", "\n", "# Recitation 12 - 30.5-3.6.2013\n", "\n", "## Last update: 30.6.2013" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# LZ compression\n", "The LZ77 compression, published by Lempel and Ziv in 1977, is widely used today is formats such as GIF, PNG, DEFLATE and COMPRESS.\n", "\n", "The basic idea is that repetitions are common in meaningful information and that space can be saved by replacing occurrences of information that was already seen by a reference to the first occurrence. \n", "The references are coded as `(offset, length)` pairs (tuples).\n", "\n", "For example, the text 'abracadabra' can be compressed as 'abracad(7,4)'. The `7` refers to the substring 7 position behind, and the `4` defines the repetition length.\n", "\n", "Counter-intuitively, the repetition length can be larger than the offset, creating an illusion of a loop or self-reference:\n", "\n", "![Eshcer drawing](http://upload.wikimedia.org/wikipedia/en/thumb/b/ba/DrawingHands.jpg/250px-DrawingHands.jpg)\n", "\n", "However, as the algorithm is implemented using a state machine (DFA) there is no real problem here.\n", "\n", "We begin by writing the repetition finding funciton. \n", "\n", "The function recieves a text and the current position in the text, and looks at the text to find a previous occurence of the substring that starts at the current position.\n", "\n", "The maximum window length and the maximum repetition length, given with default values, are used to fine tune the memory-running time tradeoff of the algorithm." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_max_repetition(text, index, window_length=2**12-1, max_rep_length=2**5-1):\n", " \"\"\" finds a maximum repetition in the text.\n", " Returns offset and length of the longest repetition.\n", " Returned offset is smallest one (closest to index) among all max matches\"\"\"\n", " assert isinstance(text, str)\n", " repitition_length = 0 # length of max repitition\n", " repitition_offset = 0 # distance backwards to begining of max repitition\n", " for offset in range(1, min(index + 1, window_length)): \n", " current_length = 0\n", " while current_length < min(max_rep_length, len(text) - index) and text[index - offset + current_length] == text[index + current_length]:\n", " current_length += + 1\n", " if repitition_length < current_length: \n", " repitition_length = current_length\n", " repitition_offset = offset\n", " return repitition_offset, repitition_length\n", "\n", "def test_find_max_repetition(text, index):\n", " offset, length = find_max_repetition(text, index)\n", " print(text[index:index + length])\n", " print(text[index-offset:index - offset + length])\n", " return text[index:index + length] == text[index - offset:index - offset + length]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "find_max_repetition(\"abracadabra\", 7)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 2, "text": [ "(7, 4)" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "find_max_repetition(\"xyzxyzwxyzw\", 3)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 3, "text": [ "(3, 3)" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "find_max_repetition(\"xyzxyzwxyzw\", 7)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "(4, 4)" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "find_max_repetition('aaaaa', 1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 5, "text": [ "(1, 4)" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to write the compression algorithm. The flow of the compression is described by this graph:\n", "\n", "![LZ compression flow](https://raw.github.com/yoavram/CS1001.py/master/lz_flow.PNG)\n", "\n", "We will start with the compression function. It will recieve a text and will iterate over the characters of the text, looking for a repetition at each character." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def lz77_compress(text, window_length=2**12-1, max_rep_length=2**5-1):\n", " \"\"\"LZ77 compression of ASCII text. \n", " Produces a list comprising of either ascii character\n", " or by a pair [m,k] where m is an offset and\n", " k is a match (both are non negative integers)\"\"\"\n", " result = []\n", " index = 0\n", " while index < len(text):\n", " if ord(text[index]) >= 128:\n", " # non-ascii\n", " index += 1\n", " continue\n", " offset, length = find_max_repetition(text, index, window_length, max_rep_length)\n", " if length < 3:\n", " # no repetition or a single character repetition\n", " result.append(text[index]) \n", " index += 1\n", " else:\n", " result.append((offset, length)) # two or more chars in repetition\n", " index += length\n", " return result # produces a list composed of chars and pairs" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "print(lz77_compress(\"You know what I hate, you know what I hate, you know what I hate? Repetitions!\"))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['Y', 'o', 'u', ' ', 'k', 'n', 'o', 'w', ' ', 'w', 'h', 'a', 't', ' ', 'I', ' ', (6, 3), 'e', ',', ' ', 'y', (22, 31), (22, 10), '?', ' ', 'R', 'e', 'p', 'e', 't', 'i', 't', 'i', 'o', 'n', 's', '!']\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "welcome = '''Welcome my son, welcome to the machine.\n", "Where have you been? It's alright we know where you've been.\n", "You've been in the pipeline, filling in time, \n", "Provided with toys and Scouting for Boys.\n", "You bought a guitar to punish your ma,\n", "And you didn't like school, and you know you're nobody's fool,\n", "So welcome to the machine.\n", "Welcome my son, welcome to the machine.\n", "What did you dream? It's alright we told you what to dream.\n", "You dreamed of a big star, he played a mean guitar,\n", "He always ate in the Steak Bar. He loved to drive in his Jaguar.\n", "So welcome to the machine.'''" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "welcome_intermediate = lz77_compress(welcome)\n", "print(welcome_intermediate)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 'm', 'y', ' ', 's', 'o', 'n', ',', ' ', 'w', (16, 7), 't', 'o', ' ', 't', 'h', (23, 3), 'a', 'c', 'h', 'i', 'n', 'e', '.', '\\n', 'W', 'h', 'e', 'r', 'e', ' ', 'h', 'a', 'v', 'e', ' ', 'y', 'o', 'u', ' ', 'b', 'e', 'e', 'n', '?', ' ', 'I', 't', \"'\", 's', ' ', 'a', 'l', 'r', 'i', 'g', 'h', 't', (58, 3), ' ', 'k', 'n', 'o', 'w', ' ', 'w', (42, 5), (37, 3), \"'\", (44, 3), (40, 4), '.', '\\n', 'Y', (13, 10), ' ', 'i', 'n', (89, 5), 'p', 'i', 'p', 'e', 'l', (90, 3), ',', ' ', 'f', 'i', 'l', (9, 3), 'g', (25, 5), 'i', 'm', (17, 3), '\\n', 'P', 'r', 'o', 'v', 'i', 'd', 'e', 'd', ' ', 'w', 'i', 't', 'h', (138, 3), 'y', (101, 3), 'n', 'd', ' ', 'S', 'c', 'o', 'u', 't', (42, 4), 'f', 'o', 'r', ' ', 'B', (22, 3), (89, 5), ' ', 'b', 'o', 'u', (127, 4), 'a', ' ', 'g', 'u', 'i', 't', 'a', 'r', (186, 4), 'p', 'u', 'n', 'i', 's', 'h', (132, 4), 'r', (194, 3), ',', '\\n', 'A', (62, 3), (182, 4), 'd', 'i', 'd', 'n', \"'\", 't', ' ', 'l', 'i', 'k', 'e', ' ', 's', 'c', 'h', 'o', 'o', 'l', ',', (90, 5), (28, 4), (188, 5), (182, 4), (189, 3), 'n', 'o', 'b', 'o', 'd', 'y', (220, 3), 'f', (35, 4), '\\n', 'S', 'o', (279, 26), (319, 31), (319, 10), 'a', 't', (127, 4), (135, 6), 'r', 'e', 'a', 'm', (318, 18), 't', 'o', 'l', (32, 6), 'w', (45, 4), (66, 3), (40, 5), (229, 6), (11, 5), (274, 3), 'o', 'f', (233, 3), 'b', 'i', 'g', ' ', 's', (235, 3), ',', ' ', (329, 4), 'l', 'a', 'y', (25, 3), 'a', ' ', 'm', 'e', 'a', 'n', (260, 7), ',', '\\n', 'H', 'e', (90, 3), 'w', 'a', (314, 4), 't', 'e', (372, 8), 'S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'r', '.', ' ', (32, 3), 'l', 'o', 'v', (56, 3), (103, 5), 'i', (413, 3), (36, 3), 'h', 'i', 's', ' ', 'J', 'a', 'g', 'u', (33, 3), (244, 27)]\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to code this list of characters and tuples to a bit string to allow for saving it to a file or transmiting it on the web.\n", "\n", "First we need an ASCII coding function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def char2bits(ch):\n", " ''' convert a character to int and then format it to binary \n", " using exactly 7 bits '''\n", " return '{:07b}'.format(ord(ch))\n", "print(char2bits('1'))\n", "print(char2bits('z'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0110001\n", "1111010\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the difference between this and:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(bin(ord('1'))[2:])\n", "print(char2bits('1'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "110001\n", "0110001\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we write the function to convert the intermediate format to the binary format.\n", "\n", "Each word is either a literal (a character) of an offset-length pair (a tuple). \n", "\n", "- Literals are converted using 7-bit ASCII (the function `char2bits` above).\n", "- Offset-length tuples are writen as binary numbers using the same number of bits always. This number is given by the maximum window length and the maximum repetition length.\n", "- Before each word we add a bit to denote is the word is a literal (`0`) or an offset-length pair (`1`)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import math\n", "def intermediate2bits(compressed, window_length=2**12-1, max_rep_length=2**5-1):\n", " \"\"\" converts intermediate format compressed list\n", " to a string of bits\"\"\"\n", " offset_width = math.ceil(math.log(window_length, 2))\n", " length_width = math.ceil(math.log(max_rep_length,2))\n", "\n", " result = []\n", " for item in compressed:\n", " if isinstance(item, str):\n", " result.append(\"0\")\n", " result.append('{:07b}'.format(ord(item)))\n", " elif isinstance(item, (tuple,list)):\n", " result.append(\"1\")\n", " offset,length = item\n", " result.append('{num:0{width}b}'.format\n", " (num=offset, width=offset_width))\n", " result.append('{num:0{width}b}'.\n", " format(num=length, width=length_width)) \n", " return \"\".join(result)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "welcome_bits = intermediate2bits(welcome_intermediate)\n", "print(welcome_bits)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "010101110110010101101100011000110110111101101101011001010010000001101101011110010010000001110011011011110110111000101100001000000111011110000000100000011101110100011011110010000001110100011010001000000010111000110110000101100011011010000110100101101110011001010010111000001010010101110110100001100101011100100110010100100000011010000110000101110110011001010010000001111001011011110111010100100000011000100110010101100101011011100011111100100000010010010111010000100111011100110010000001100001011011000111001001101001011001110110100001110100100000011101000011001000000110101101101110011011110111011100100000011101111000000101010001011000000100101000110010011110000001011000001110000001010000010000101110000010100101100110000000011010101000100000011010010110111010000010110010010101110000011010010111000001100101011011001000001011010000110010110000100000011001100110100101101100100000000100100011011001111000000011001001010110100101101101100000001000100011000010100101000001110010011011110111011001101001011001000110010101100100001000000111011101101001011101000110100010000100010100001101111001100000110010100011011011100110010000100000010100110110001101101111011101010111010010000001010100010001100110011011110111001000100000010000101000000010110000111000001011001001010010000001100010011011110111010110000011111110010001100001001000000110011101110101011010010111010001100001011100101000010111010001000111000001110101011011100110100101110011011010001000010000100001000111001010000110000100001100101100000010100100000110000001111100001110000101101100010001100100011010010110010001101110001001110111010000100000011011000110100101101011011001010010000001110011011000110110100001101111011011110110110000101100100000101101000101100000001110000100100001011110000101100001011011000100100001011110100011011011100110111101100010011011110110010001111001100001101110000011011001101000000100011001000000101001010011011011111000100010111110101000100111111111111000100111111010100110000101110100100000111111100100100001000011100110011100100110010101100001011011011000100111110100100111010001101111011011001000000100000001100111011110000001011010010010000010000100001110000001010000010110000111001010011010000000010110010110001000100100001101101111011001101000011101001000110110001001101001011001110010000001110011100001110101100011001011000010000010001010010010010001101100011000010111100110000000110010001101100001001000000110110101100101011000010110111010001000001000011100101100000010100100100001100101100000101101000011011101110110000110001001110100010001110100011001011000101110100010000101001101110100011001010110000101101011001000000100001001100001011100100010111000100000100000010000000011011011000110111101110110100000011100000011100000110011100101011010011000110011101000111000000100100000110110100001101001011100110010000001001010011000010110011101110101100000010000100011100001111010011011\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now write this bitstring to binary file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def bitstring_to_file(bitstring, filename):\n", " bytestring = ( bitstring[i:i + 8] for i in range(0, len(bitstring), 8) )\n", " intstring = (int(x, 2) for x in bytestring)\n", " chrstring = (chr(x) for x in intstring)\n", " output = ''.join(chrstring)\n", " with open(filename,'wb') as f:\n", " f.write(bytes(output, 'utf-8'))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "bitstring_to_file(welcome_bits, 'welcome.lz')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "!cat welcome.lz" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Welcome my son, w\u05b2\u20ac\u05b2\ufffd\u05b3\ufffd\u001b\u05b3\u02c6\u001d", "\u001a .6\u00166\u05b2\u2020\u05b2\u2013\u05b3\u00a6R\u05b3\u00a0\u05b2\u00a5v\u05b2\u2020W&R\u0006\u05b2\u2020\u0017fR\u0007\u05b2\u2013\u05b3\u00b7R\u0006&VV\u05b3\u00a3\u05b3\u00b2\u0004\u05b2\u2014Bw2\u0006\u0016\u05b3\u2021&\u05b2\u2013v\u05b2\u2021H\u001d", "\f", "\u05b2\ufffd\u05b2\u00ad\u05b2\u00b9\u05b2\u00bd\u05b3\ufffd\u05b2\ufffd\u05b3\ufffd\u0005E\u05b2\ufffd(\u05b3\u2030\u05b3\u00a0X8\u0014\u0010\u05b2\u00b8)f\u0001\u05b2\u00d7 in\u05b2\u201a\u05b3\u2030\\\u001a\\\u0019[ \u05b2\u00b42\u05b3\u201a\u0006f\u05b2\u2013\u05b3\u02c6\u0004\u05b2\ufffd\u05b2\ufffd\u0003%im\u05b2\u20ac\u05b2\u02c6\u05b3\u201a\u05b2\u201d\u001c", "\u05b2\u203a\u05b3\ufffd\u05b2\ufffdY\u0019Y\b\u001d", "\u05b3\ufffd]\u001a!\u00147\u05b2\u02dc2\u05b2\ufffd\u05b2\u00b9\u05b2\ufffd\u05b2\ufffdM\u05b2\ufffd\u05b2\u00bd\u05b3\u2022\u05b3\u2019\u0005Dfor B\u05b2\u20ac\u05b2\u00b0\u05b3\u00a0\u05b2\u00b2R\u0006&\u05b3\u00b7X?\u05b2\u2018\u05b2\u201e\u05b2\ufffd\u05b2\ufffd\u05b3\u2022\u05b2\u00a5\u05b3\u2018\u05b2\u2026\u05b3\ufffd\u0017Dpunish\u05b2\u201e!\u001c", "\u05b2\u00a1\u05b2\u201e2\u05b3\u20ac\u05b2\u20aa\u0018\u001f\u000e\u0016\u05b3\u201edidn't like school,\u05b2\u201a\u05b3\u2018`8H^\u0016\u0016\u05b3\u201e\u05b2\u2026\u05b3\u00a8\u05b3\u203a\u05b2\u203a\u05b3\u02dc\u05b2\u203a\u05b3\u2122\u001e", "a\u05b2\u00b86h\u0011\u05b2\ufffd)M\u05b2\u00be\"\u05b3\u00f7\u05b2\u2030\u05b3\u00bf\u05b3\u00a2~\u05b2\u00a6\u0017H?\u05b2\u2019\u0010\u05b3\u00a6ream\u05b2\u2030\u05b3\u00b4\u05b2\ufffd\u001b\u05b3\u203a @gx\u0016\u05b2\u2019\bC\u05b2\ufffdAa\u05b3\ufffdh\u0005\u05b2\u2013\"Cof\u05b2\u2021H\u05b3\u02dc\u05b2\ufffdY\u05b3\u02c6\u001c", "\u05b3\u00a1\u05b3\u20132\u05b3\u201a\b\u05b2\u20aa\u05b2\u2018\u05b2\u00b1\u05b2\u2026\u05b3\u00a6\u0003#a mean\u05b2\u02c6!\u05b3\u2039\u0002\u05b2\u2019\u0019`\u05b2\u00b47v\u0018\u05b2\ufffd\u0011\u05b3\u2018\u05b2\u2013.\u05b2\u02c6Steak Bar. \u05b2\ufffd\u0000\u05b3\u203a\u001b\u05b3\ufffd\u05b2\u00a0p83\u05b2\u2022\u05b2\u00a63\u05b2\u00a3\u05b2\ufffd \u05b3\ufffd\u001a\\\u05b3\u02c6\u0012\u05b2\u02dcY\u05b3\ufffd`B8z\u001b\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we want to do the reverse - decompress the bitstring to the text:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def bits2intermediate(bitstring, window_length=2**12-1, max_rep_length=2**5-1):\n", " \"\"\" converts a compressed string of bits\n", " to intermediate compressed format \"\"\"\n", " offset_width = math.ceil(math.log(window_length, 2))\n", " length_width = math.ceil(math.log(max_rep_length, 2))\n", "\n", " result=[]\n", " i = 0\n", " while i < len(bitstring):\n", " if bitstring[i] == \"0\": # single ascii char\n", " i += 1\n", " ch = chr(int(bitstring[i:i + 7], 2))\n", " result.append(ch)\n", " i += 7\n", " elif bitstring[i] == \"1\": # repeat of length >= 2\n", " i += 1\n", " offset = int(bitstring[i:i + offset_width], 2)\n", " i += offset_width\n", " length = int(bitstring[i:i + length_width], 2)\n", " i += length_width\n", " result.append((offset,length))\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "bits2intermediate(welcome_bits)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 19, "text": [ "['W',\n", " 'e',\n", " 'l',\n", " 'c',\n", " 'o',\n", " 'm',\n", " 'e',\n", " ' ',\n", " 'm',\n", " 'y',\n", " ' ',\n", " 's',\n", " 'o',\n", " 'n',\n", " ',',\n", " ' ',\n", " 'w',\n", " (16, 7),\n", " 't',\n", " 'o',\n", " ' ',\n", " 't',\n", " 'h',\n", " (23, 3),\n", " 'a',\n", " 'c',\n", " 'h',\n", " 'i',\n", " 'n',\n", " 'e',\n", " '.',\n", " '\\n',\n", " 'W',\n", " 'h',\n", " 'e',\n", " 'r',\n", " 'e',\n", " ' ',\n", " 'h',\n", " 'a',\n", " 'v',\n", " 'e',\n", " ' ',\n", " 'y',\n", " 'o',\n", " 'u',\n", " ' ',\n", " 'b',\n", " 'e',\n", " 'e',\n", " 'n',\n", " '?',\n", " ' ',\n", " 'I',\n", " 't',\n", " \"'\",\n", " 's',\n", " ' ',\n", " 'a',\n", " 'l',\n", " 'r',\n", " 'i',\n", " 'g',\n", " 'h',\n", " 't',\n", " (58, 3),\n", " ' ',\n", " 'k',\n", " 'n',\n", " 'o',\n", " 'w',\n", " ' ',\n", " 'w',\n", " (42, 5),\n", " (37, 3),\n", " \"'\",\n", " (44, 3),\n", " (40, 4),\n", " '.',\n", " '\\n',\n", " 'Y',\n", " (13, 10),\n", " ' ',\n", " 'i',\n", " 'n',\n", " (89, 5),\n", " 'p',\n", " 'i',\n", " 'p',\n", " 'e',\n", " 'l',\n", " (90, 3),\n", " ',',\n", " ' ',\n", " 'f',\n", " 'i',\n", " 'l',\n", " (9, 3),\n", " 'g',\n", " (25, 5),\n", " 'i',\n", " 'm',\n", " (17, 3),\n", " '\\n',\n", " 'P',\n", " 'r',\n", " 'o',\n", " 'v',\n", " 'i',\n", " 'd',\n", " 'e',\n", " 'd',\n", " ' ',\n", " 'w',\n", " 'i',\n", " 't',\n", " 'h',\n", " (138, 3),\n", " 'y',\n", " (101, 3),\n", " 'n',\n", " 'd',\n", " ' ',\n", " 'S',\n", " 'c',\n", " 'o',\n", " 'u',\n", " 't',\n", " (42, 4),\n", " 'f',\n", " 'o',\n", " 'r',\n", " ' ',\n", " 'B',\n", " (22, 3),\n", " (89, 5),\n", " ' ',\n", " 'b',\n", " 'o',\n", " 'u',\n", " (127, 4),\n", " 'a',\n", " ' ',\n", " 'g',\n", " 'u',\n", " 'i',\n", " 't',\n", " 'a',\n", " 'r',\n", " (186, 4),\n", " 'p',\n", " 'u',\n", " 'n',\n", " 'i',\n", " 's',\n", " 'h',\n", " (132, 4),\n", " 'r',\n", " (194, 3),\n", " ',',\n", " '\\n',\n", " 'A',\n", " (62, 3),\n", " (182, 4),\n", " 'd',\n", " 'i',\n", " 'd',\n", " 'n',\n", " \"'\",\n", " 't',\n", " ' ',\n", " 'l',\n", " 'i',\n", " 'k',\n", " 'e',\n", " ' ',\n", " 's',\n", " 'c',\n", " 'h',\n", " 'o',\n", " 'o',\n", " 'l',\n", " ',',\n", " (90, 5),\n", " (28, 4),\n", " (188, 5),\n", " (182, 4),\n", " (189, 3),\n", " 'n',\n", " 'o',\n", " 'b',\n", " 'o',\n", " 'd',\n", " 'y',\n", " (220, 3),\n", " 'f',\n", " (35, 4),\n", " '\\n',\n", " 'S',\n", " 'o',\n", " (279, 26),\n", " (319, 31),\n", " (319, 10),\n", " 'a',\n", " 't',\n", " (127, 4),\n", " (135, 6),\n", " 'r',\n", " 'e',\n", " 'a',\n", " 'm',\n", " (318, 18),\n", " 't',\n", " 'o',\n", " 'l',\n", " (32, 6),\n", " 'w',\n", " (45, 4),\n", " (66, 3),\n", " (40, 5),\n", " (229, 6),\n", " (11, 5),\n", " (274, 3),\n", " 'o',\n", " 'f',\n", " (233, 3),\n", " 'b',\n", " 'i',\n", " 'g',\n", " ' ',\n", " 's',\n", " (235, 3),\n", " ',',\n", " ' ',\n", " (329, 4),\n", " 'l',\n", " 'a',\n", " 'y',\n", " (25, 3),\n", " 'a',\n", " ' ',\n", " 'm',\n", " 'e',\n", " 'a',\n", " 'n',\n", " (260, 7),\n", " ',',\n", " '\\n',\n", " 'H',\n", " 'e',\n", " (90, 3),\n", " 'w',\n", " 'a',\n", " (314, 4),\n", " 't',\n", " 'e',\n", " (372, 8),\n", " 'S',\n", " 't',\n", " 'e',\n", " 'a',\n", " 'k',\n", " ' ',\n", " 'B',\n", " 'a',\n", " 'r',\n", " '.',\n", " ' ',\n", " (32, 3),\n", " 'l',\n", " 'o',\n", " 'v',\n", " (56, 3),\n", " (103, 5),\n", " 'i',\n", " (413, 3),\n", " (36, 3),\n", " 'h',\n", " 'i',\n", " 's',\n", " ' ',\n", " 'J',\n", " 'a',\n", " 'g',\n", " 'u',\n", " (33, 3),\n", " (244, 27)]" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "def lz77_decompress(compressed, window_length=2**12-1, max_rep_length=2**5-1):\n", " \"\"\"LZ77 decompression from intermediate format to ASCII text\"\"\"\n", " result = ''\n", " i = 0\n", " while i < len(compressed):\n", " if isinstance(compressed[i], str): # char, as opposed to a pair\n", " result += compressed[i]\n", " i += 1\n", " else:\n", " offset,rep_length = compressed[i]\n", " i += 1\n", " for j in range(rep_length):\n", " result += result[-offset] # fixed offset\"to the left\" as result itself grows\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "print(lz77_decompress(welcome_intermediate))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Welcome my son, welcome to the machine.\n", "Where have you been? It's alright we know where you've been.\n", "You've been in the pipeline, filling in time, \n", "Provided with toys and Scouting for Boys.\n", "You bought a guitar to punish your ma,\n", "And you didn't like school, and you know you're nobody's fool,\n", "So welcome to the machine.\n", "Welcome my son, welcome to the machine.\n", "What did you dream? It's alright we told you what to dream.\n", "You dreamed of a big star, he played a mean guitar,\n", "He always ate in the Steak Bar. He loved to drive in his Jaguar.\n", "So welcome to the machine.\n" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Full compression in one command and a function to calculate the compression ratio:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "compress = lambda text: intermediate2bits(lz77_compress(text))\n", "lz_ratio = lambda text: len(compress(text)) / (len(text)*7) " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to check the compression ratio:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lz_ratio(welcome)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 23, "text": [ "0.733604473817997" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "from urllib.request import urlopen\n", "with urlopen(\"http://www.gutenberg.org/cache/epub/28233/pg28233.txt\") as f:\n", " newton = f.read().decode('utf-8')\n", "print(newton[:newton.index('\\n\\r')])\n", "newton = ''.join(ch for ch in newton if ord(ch) < 128)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\ufeffThe Project Gutenberg EBook of Philosophiae Naturalis Principia Mathematica, by \r\n", "Isaac Newton\r\n" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "len(newton), lz_ratio(newton[:len(newton)//100])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 25, "text": [ "(835497, 0.6752282909812237)" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit compress(welcome)\n", "%timeit -n 1 compress(newton[:len(newton)//1000])\n", "%timeit -n 1 compress(newton[:len(newton)//100])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10 loops, best of 3: 92.6 ms per loop\n", "1 loops, best of 3: 282 ms per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1 loops, best of 3: 12.3 s per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some more interesting examples:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lz77_compress(\"a\"*40)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 27, "text": [ "['a', (1, 31), (1, 8)]" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So you need 8 bits for the initial `a`, then 1+12+5=18 bits for the first offset-length pair of length 32 and another 18 bits for the second offset-length pair for a total of 44 bits:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(compress(\"a\"*40))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 28, "text": [ "44" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be reduced by increasing the max repetition length:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lz77_compress(\"a\"*40, max_rep_length=2**6-1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 30, "text": [ "['a', (1, 39)]" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "For 100 repetitions of `a`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lz77_compress(\"a\"*100)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 31, "text": [ "['a', (1, 31), (1, 31), (1, 31), (1, 6)]" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "len(compress(\"a\"*100))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 32, "text": [ "80" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in general if we have `k` repetitions of `a`, then we need $(k-1)//31 + 1$ offset-length pairs plus the initial `a`, for a total of $8+18((k-1)//31+1)$:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "k = 1000\n", "print(len(compress('a'*k)))\n", "print(18*((k-1)//31+1)+8)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "602\n", "602\n" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Try at home** - compare LZ and Huffman compression ratios on you favorite text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework\n", "\n", "# `k_select`\n", "\n", "In homework 4, you were asked to implement recursively a function k_select, which finds a rank-k element in an orderable set. To reinforce your understanding of recursion, let's solve it together on board. We'll also analyze the number of comparisons which k_select performs in the best case and the worst case." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def k_select(lst, k):\n", "\tn = len(lst)\n", "\tassert 0 <= k < n\n", "\tpivot = lst[0]\n", "\tsmall = [x for x in lst if x < pivot]\n", "\tlarge = [x for x in lst if x > pivot]\n", "\tif k < len(small):\n", "\t\treturn k_select(small, k)\n", "\telif k >= n - len(large):\n", "\t\treturn k_select(large, k - (n - len(large)))\n", "\telse:\n", "\t\treturn pivot" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 34 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Best case\n", "After one partition the pivot (left most element) is the k-th element and you are done - $O(n)$.\n", "\n", "#### Worst case\n", " \u05db\u05dc \u05e4\u05e2\u05dd \u05de\u05e7\u05d8\u05d9\u05e0\u05d9\u05dd \u05d0\u05ea \u05d4\u05d2\u05d5\u05d3\u05dc \u05d1- 1 (\u05d6\u05d4 \u05e7\u05d5\u05e8\u05d4 \u05db\u05d0\u05e9\u05e8 \u05d4\u05e8\u05e9\u05d9\u05de\u05d4 \u05de\u05de\u05d5\u05d9\u05e0\u05ea / \u05de\u05de\u05d5\u05d9\u05e0\u05ea \u05d4\u05e4\u05d5\u05da / \u05d0\u05d5 \u05d1\u05d0\u05d5\u05e4\u05df \u05db\u05dc\u05dc\u05d9 \u05db\u05d0\u05e9\u05e8 \u05d4\u05d0\u05d9\u05d1\u05e8 \u05d4\u05e9\u05de\u05d0\u05dc\u05d9 \u05d1\u05db\u05dc \u05e4\u05e2\u05dd \u05d4\u05d5\u05d0 \u05d4\u05de\u05d9\u05e0/\u05de\u05e7\u05e1. \u05d1\u05e0\u05d5\u05e1\u05e3, \u05d0\u05e0\u05d7\u05e0\u05d5 \u05de\u05d2\u05d9\u05e2\u05d9\u05dd \u05dc\u05d0\u05d9\u05d1\u05e8 \u05e9\u05de\u05d7\u05e4\u05e9\u05d9\u05dd \u05e8\u05e7 \u05d1\u05e1\u05d5\u05e3. \u05e1\u05d4\"\u05db O(n^2).\n", "At each call you reduce the list length by one. In addition, you reach the number you are looking for only at the end - $O(n^2)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recursive reality show\n", "In homework 5, question 2 reality show contestents were given a list of missions and a time to do the missions. \n", "\n", "The missions list is a list of tuples of missions points and time to complete the mission:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "missions = [ (8, 7) , (4, 6) , (9, 8) , (13, 10) ] " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You had to write a function that outputs the maximum number for a given missions list and a time interval.\n", "\n", "The recursion formula is, for a given list of points $p$ and list ot times $t$ of length $n$ and for a time interval $T$:\n", "$$\n", "A(n,0) = 0 \\\\\\\\\n", "A(0,T) = 0 \\\\\\\\\n", "A(n,T) = max(A(n-1,T-t_n) + p_n \\; , \\; A(n-1,T)) \\; if \\; t_n \\le T \\\\\\\\\n", "A(n,T) = A(n-1,T) \\; if \\; t_n > T\n", "$$\n", "\n", "The recursive step is made by either doing the n-th mission or not doing it.\n", "\n", "The code for this question is:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def choose_mem(missions, time, mem, i):\n", " key = (time, i)\n", " if key not in mem:\n", " if time == 0 or i < 0:\n", " mem[key] = 0\n", " else:\n", " p, t = missions[i]\n", " if t > time:\n", " mem[key] = choose_mem(missions, time, mem, i - 1)\n", " else: \n", " mem[key] = max(choose_mem(missions, time, mem, i - 1), p + choose_mem(missions, time - t, mem, i - 1))\n", " return mem[key] \n", "\n", "def choose(missions, time):\n", " return choose_mem(missions, time, {}, len(missions) - 1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "choose([ (7, 8) , (6, 4) , (8, 9) , (10, 13) ], 15)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 37, "text": [ "14" ] } ], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fin\n", "This notebook is part of the [Extended introduction to computer science](http://tau-cs1001-py.wikidot.com/) course at Tel-Aviv University.\n", "\n", "The notebook was written using Python 3.2 and IPython 0.13.1.\n", "\n", "The code is available at .\n", "\n", "The notebook can be viewed online at .\n", "\n", "This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/)." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }