{ "metadata": { "name": "recitation11" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS1001.py\n", "\n", "## Extended Introduction to Computer Science with Python, Tel-Aviv University, Spring 2013\n", "\n", "# Recitation 11 - 23-27.5.2013\n", "\n", "## Last update: 28.5.2013" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Huffman codes\n", "\n", "Unicode and [ASCII](http://www.asciitable.com/) code map characters to bits, but all characters (letters, numbers, punctuation marks) map to the same number of bits (7 for ASCII).\n", "\n", "The **Huffman code** is a *variable length code*. The code length (in bits) for each datum (data point) is proportional to its frequency in the data. This is effectively a *loss-less compression*. The frequencies are calculated from a **corpus** - a big representative data set. \n", "\n", "To preserve the reversability of the code (one-to-one) we use **prefix free** codes - no code is a prefix of another code, i.e. if 0110 is a code, then there is no other code theat begins with 0110.\n", "\n", "![Huffman code flow](https://raw.github.com/yoavram/CS1001.py/master/huffman_flow.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start with a simple code for creating a histogram of the data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def char_count(text):\n", " histogram = {}\n", " for ch in text:\n", " histogram[ch] = histogram.get(ch, 0) + 1\n", " return histogram" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "letter_count = char_count(\"live and let live\")\n", "letter_count" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 2, "text": [ "{' ': 3, 'a': 1, 'd': 1, 'e': 3, 'i': 2, 'l': 3, 'n': 1, 't': 1, 'v': 2}" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to build the Huffman tree." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def by_value(tup):\n", " return tup[1]\n", "\n", "def build_huffman_tree(letter_count):\n", " \"\"\" recieves dictionary with char:count entries\n", " generates a list of letters representing the binary Huffman encoding tree\"\"\"\n", " queue = list(letter_count.items())\n", " while len(queue) >= 2:\n", " queue.sort(key=by_value) # sort by the count\n", " # combine two least frequent elements\n", " elem1, freq1 = queue.pop(0)\n", " elem2, freq2 = queue.pop(0) \n", " elems = (elem1, elem2)\n", " freqs = freq1 + freq2\n", " queue.append((elems,freqs))\n", " return queue[0][0]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "tree = build_huffman_tree(letter_count)\n", "tree" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "(('l', ('i', 'v')), ((('a', 'd'), ('n', 't')), (' ', 'e')))" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Huffman tree example](https://raw.github.com/yoavram/CS1001.py/master/huffman_tree.png)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def build_codebook(huff_tree, prefix=''):\n", " \"\"\" receives a Huffman tree with embedded encoding and a prefix of encodings.\n", " returns a dictionary where characters are keys and associated binary strings are values.\"\"\"\n", " if isinstance(huff_tree, str): # a leaf\n", " return {huff_tree: prefix}\n", " else:\n", " left, right = huff_tree\n", " codebook = {}\n", " codebook.update( build_codebook(left, prefix + '0'))\n", " codebook.update( build_codebook(right, prefix + '1'))\n", " return codebook" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "codebook = build_codebook(tree)\n", "for ch,freq in sorted(letter_count.items(), key=by_value, reverse=True):\n", " print(ch, freq, codebook[ch])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " 3 110\n", "e 3 111\n", "l 3 00\n", "i 2 010\n", "v 2 011\n", "a 1 1000\n", "d 1 1001\n", "n 1 1010\n", "t 1 1011\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "def compress(text, codebook):\n", " ''' compress text using codebook dictionary '''\n", " return ''.join(codebook[ch] for ch in text if ord(ch) <= 128)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "ch,ord(ch),'{:08b}'.format(ord(ch))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 13, "text": [ "('t', 116, '01110100')" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "bits_ascii = ''.join(['{:07b}'.format(ord(ch)) for ch in \"live and let live\" if ord(ch) <= 128])\n", "huff_bits = compress(\"live and let live\", codebook)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "print(len(bits_ascii),bits_ascii)\n", "print(len(huff_bits),huff_bits)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "119 11011001101001111011011001010100000110000111011101100100010000011011001100101111010001000001101100110100111101101100101\n", "52 0001001111111010001010100111000111101111000010011111\n" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "def build_decoding_dict(codebook):\n", " \"\"\"build the \"reverse\" of encoding dictionary\"\"\"\n", " return {v:k for k,v in codebook.items()}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "decodebook = build_decoding_dict(codebook)\n", "decodebook" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 13, "text": [ "{'00': 'l',\n", " '010': 'i',\n", " '011': 'v',\n", " '1000': 'a',\n", " '1001': 'd',\n", " '1010': 'n',\n", " '1011': 't',\n", " '110': ' ',\n", " '111': 'e'}" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "def decompress(bits, decodebook):\n", " prefix = ''\n", " result = []\n", " for bit in bits:\n", " prefix += bit\n", " if prefix not in decodebook:\n", " continue\n", " result.append(decodebook[prefix])\n", " prefix = ''\n", " assert prefix == '' # must finish last codeword\n", " return ''.join(result) # converts list of chars to a string" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "decompress(huff_bits, decodebook)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 15, "text": [ "'live and let live'" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def huffman_code(corpus):\n", " counts = char_count(corpus)\n", " tree = build_huffman_tree(counts)\n", " return build_codebook(tree)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is only one way to create the codebook for this corpus:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "huffman_code('aaabbc')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 17, "text": [ "{'a': '0', 'b': '11', 'c': '10'}" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are different ways to create a codebook for this corpus - `b` and `c` are interchangeable:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "huffman_code('aaabbcc')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 18, "text": [ "{'a': '0', 'b': '11', 'c': '10'}" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the frequency of all characters is equal, so the code is actually a fixed length code - most characters " ] }, { "cell_type": "code", "collapsed": false, "input": [ "codebook = huffman_code('qwertuioplkjhgfdsazxcvbnm')\n", "codebook" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 19, "text": [ "{'a': '01110',\n", " 'b': '10000',\n", " 'c': '01111',\n", " 'd': '10010',\n", " 'e': '10001',\n", " 'f': '10100',\n", " 'g': '10011',\n", " 'h': '10110',\n", " 'i': '10101',\n", " 'j': '11000',\n", " 'k': '10111',\n", " 'l': '11010',\n", " 'm': '11001',\n", " 'n': '11100',\n", " 'o': '11011',\n", " 'p': '11110',\n", " 'q': '11101',\n", " 'r': '0000',\n", " 's': '11111',\n", " 't': '0010',\n", " 'u': '0001',\n", " 'v': '0100',\n", " 'w': '0011',\n", " 'x': '0101',\n", " 'z': '0110'}" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Average length of 4.72, most codes are of length 5:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mean([len(v) for v in codebook.values()])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 20, "text": [ "4.7199999999999998" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next corpus every letter appears a different number of times:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "corpus = ''.join([ch*(i+1) for i,ch in enumerate('qwertuioplkjhgfdsazxcvbnm') ])\n", "corpus" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 21, "text": [ "'qwweeerrrrtttttuuuuuuiiiiiiioooooooopppppppppllllllllllkkkkkkkkkkkjjjjjjjjjjjjhhhhhhhhhhhhhggggggggggggggfffffffffffffffddddddddddddddddsssssssssssssssssaaaaaaaaaaaaaaaaaazzzzzzzzzzzzzzzzzzzxxxxxxxxxxxxxxxxxxxxcccccccccccccccccccccvvvvvvvvvvvvvvvvvvvvvvbbbbbbbbbbbbbbbbbbbbbbbnnnnnnnnnnnnnnnnnnnnnnnnmmmmmmmmmmmmmmmmmmmmmmmmm'" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "huffman_code(corpus)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 22, "text": [ "{'a': '0100',\n", " 'b': '1011',\n", " 'c': '1000',\n", " 'd': '0010',\n", " 'e': '1101110',\n", " 'f': '0000',\n", " 'g': '11111',\n", " 'h': '11110',\n", " 'i': '00010',\n", " 'j': '11010',\n", " 'k': '10011',\n", " 'l': '10010',\n", " 'm': '1110',\n", " 'n': '1100',\n", " 'o': '00011',\n", " 'p': '01010',\n", " 'q': '11011110',\n", " 'r': '010110',\n", " 's': '0011',\n", " 't': '010111',\n", " 'u': '110110',\n", " 'v': '1010',\n", " 'w': '11011111',\n", " 'x': '0111',\n", " 'z': '0110'}" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do we do if some characters in the text did not appear in the corpus?\n", "\n", "- We can add them to the corpus - not very reasonable for large text\n", "- We can add all characters to the corpus with a small initial frequency:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(''.join([chr(c) for c in range(128)]))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n", "\u000b", "\f", "\r", "\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c", "\u001d", "\u001e", "\u001f !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f\n" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compression ratio\n", "\n", "This is a measure of how much space is saved by using the Huffman code to compress the text:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def compression_ratio(text, corpus):\n", " codebook = huffman_code(corpus)\n", " len_compress = len(compress(text, codebook))\n", " len_ascii = len(text) * 7\n", " return len_compress / len_ascii" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets see how the language of the corpus affects the compression ratio:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from urllib.request import urlopen\n", "with urlopen(\"http://www.gutenberg.org/cache/epub/42745/pg42745.txt\") as f:\n", " gauss = f.read().decode('utf-8')\n", "print(gauss[:gauss.index('\\n\\r')])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\ufeffThe Project Gutenberg EBook of Gauss, by Friedrich August Theodor Winnecke\r\n" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "with urlopen(\"http://www.gutenberg.org/files/25447/25447-0.txt\") as f:\n", " russell = f.read().decode('utf-8')\n", "print(russell[:russell.index('\\n\\r')])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\ufeffProject Gutenberg's Mysticism and Logic and Other Essays, by Bertrand Russell\r\n" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "with urlopen(\"http://www.gutenberg.org/files/42743/42743-0.txt\") as f:\n", " table_ronde = f.read().decode('utf-8')\n", "print(table_ronde[:table_ronde.index('\\n\\r')])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\ufeffProject Gutenberg's Les Romans de la Table Ronde (1 / 5), by Anonyme\r\n" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "with urlopen(\"http://www.gutenberg.org/cache/epub/97/pg97.txt\") as f:\n", " flatland_book = f.read().decode('utf-8')\n", "print(flatland_book[:flatland_book.index('\\n\\r')])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\ufeffThe Project Gutenberg EBook of Flatland, by Edwin A. Abbott\r\n" ] } ], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "print(\"self\",compression_ratio(flatland_book, flatland_book))\n", "print(\"en\",compression_ratio(flatland_book, russell))\n", "print(\"de\",compression_ratio(flatland_book, gauss))\n", "print(\"fr\",compression_ratio(flatland_book, table_ronde))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "self 0.6564487342348225\n", "en" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.6600341125397221\n", "de" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.6740207814752815\n", "fr" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.6887171389789191\n" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the Huffman tree for \"Flatland\", with some minor modifications to the text to make the tree more clear:\n", "\n", "![Flatland tree](https://raw.github.com/yoavram/CS1001.py/master/flatland_tree.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Supplementry\n", "\n", "The following code was used to generate the Huffman tree image:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pydot # https://github.com/nlhepler/pydot-py3, http://pyparsing.wikispaces.com/Download+and+Installation" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Couldn't import dot_parser, loading of dot files will not be possible.\n" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "def tree_to_str(tree):\n", " if isinstance(tree, str):\n", " return tree\n", " return ''.join([tree_to_str(item) for item in tree])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "def add_to_node(tree, node, graph):\n", " left, right = tree\n", " left_node, right_node = pydot.Node(tree_to_str(left)), pydot.Node(tree_to_str(right))\n", " graph.add_node(left_node)\n", " graph.add_node(right_node)\n", " graph.add_edge(pydot.Edge(node, left_node))\n", " graph.add_edge(pydot.Edge(node, right_node))\n", " if not isinstance(left, str):\n", " add_to_node(left, left_node, graph)\n", " if not isinstance(right, str):\n", " add_to_node(right, right_node, graph)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [ "def tree_to_graph(tree):\n", " graph = pydot.Dot(graph_type='digraph',strict=True)\n", " root = pydot.Node(tree_to_str(tree))\n", " graph.add_node(root)\n", " add_to_node(tree, root, graph)\n", " return graph" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "tree_to_str(tree)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 34, "text": [ "'livadnt e'" ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "graph = tree_to_graph(tree)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 71 }, { "cell_type": "code", "collapsed": false, "input": [ "with open(\"tmp.dot\",\"w\") as f:\n", " f.write(graph.to_string())\n", "!dot tmp.dot -Tpng -ohuffman_tree.png \n", "!huffman_tree.png" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 72 }, { "cell_type": "code", "collapsed": false, "input": [ "for c in flatland_book:\n", " if ord(c) >= 128:\n", " print(ord(c))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "65279\n" ] } ], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "flatland_book_clean = flatland_book.replace(chr(65279), '').replace('\"','').replace(\"'\",'').replace(\",\",'').replace(\";\",'').replace(\":\",'').replace(\".\",'')\n", "flatland_tree = build_huffman_tree(char_count(flatland_book_clean.lower()))\n", "tree_to_str(flatland_tree)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 38, "text": [ "'rucesnk-vwpgiadloty\\n\\rf854/1z)x($@#%[]729*qj?!063bmh '" ] } ], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "graph = tree_to_graph(flatland_tree)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "with open(\"tmp.dot\",\"w\") as f:\n", " f.write(graph.to_string())\n", "!dot tmp.dot -Tpng -oflatland_tree.png \n", "!flatland_tree.png" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 44 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fin\n", "This notebook is part of the [Extended introduction to computer science](http://tau-cs1001-py.wikidot.com/) course at Tel-Aviv University.\n", "\n", "The notebook was written using Python 3.2 and IPython 0.13.1.\n", "\n", "The code is available at .\n", "\n", "The notebook can be viewed online at .\n", "\n", "This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/)." ] } ], "metadata": {} } ] }