{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# `numpy`\n",
    "## A multidimensional array framework and more"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = np.zeros((10,5)) # create an array filled with zeros\n",
    "A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('float64')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.dtype # default data type is f8, aka double-precision floating point (8 bytes per number)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "B = np.zeros((10,5), dtype='i4') # 4 byte (32 bit) integer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('int32')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Some common data types:\n",
    "- `f` floating-point number: `f4` single precision, `f8` double precision\n",
    "- `i` (signed) integer number: `i4` 32-bit integer, `i8` 64-bit integer\n",
    "- `u` unsigned integer\n",
    "- `c` complex floating-point: `c16` double-precision for real and imaginary part\n",
    "- `S` string\n",
    "- `O` arbitrary Python objects (inefficent but flexible)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Arrays can be initialized using anything iterable. Most commonly this is a (nested) list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 2, 3],\n",
       "       [4, 5, 6],\n",
       "       [7, 8, 9]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C = np.array([[1,2,3],[4,5,6],[7,8,9]])\n",
    "C"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C[0,2] # indices always start at 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Array Operations\n",
    "Many operators are overloaded so that operations are applied element-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 2,  4,  6],\n",
       "       [ 8, 10, 12],\n",
       "       [14, 16, 18]])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "2 * C"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 2,  6, 12],\n",
       "       [20, 30, 42],\n",
       "       [56, 72, 90]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C + C**2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 2,  6, 10],\n",
       "       [ 6, 10, 14],\n",
       "       [10, 14, 18]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C + C.T # .T transposes the array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Shapes\n",
    "Arrays can easily be flattened (converted to 1D) or reshaped, provided that the total size does not change."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3, 3)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C.flatten()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,\n",
       "       17, 18, 19])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D = np.arange(20) # like range but returns an array instead of an iterator\n",
    "D"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0,  1,  2,  3,  4],\n",
       "       [ 5,  6,  7,  8,  9],\n",
       "       [10, 11, 12, 13, 14],\n",
       "       [15, 16, 17, 18, 19]])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D.reshape((4,5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Broadcasting\n",
    "For operations between two arrays to succeed the corresponding dimensions have to be equal or one of them has to be one. In the latter case the size-one dimension is broadcast over the entries of the other array in that dimension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "E = np.array([10, 20, 30])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 2, 3],\n",
       "       [4, 5, 6],\n",
       "       [7, 8, 9]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([10, 20, 30])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "E"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[11, 22, 33],\n",
       "       [14, 25, 36],\n",
       "       [17, 28, 39]])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C + E"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "To facilitate efficient broadcasting, empty dimensions can be inserted using `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "F = E[None,:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10, 20, 30]])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[11, 22, 33],\n",
       "       [14, 25, 36],\n",
       "       [17, 28, 39]])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C + F"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Broadcasting can create large arrays from 1D arrays.\n",
    "\n",
    "For example, compute radius on a 2D grid from 1D coordinate arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = np.arange(10)[:,None]\n",
    "Y = np.arange(10)[None,:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((10, 1), (1, 10))"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape, Y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 10)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(X**2 + Y**2).shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  0.        ,   1.        ,   2.        ,   3.        ,\n",
       "          4.        ,   5.        ,   6.        ,   7.        ,\n",
       "          8.        ,   9.        ],\n",
       "       [  1.        ,   1.41421356,   2.23606798,   3.16227766,\n",
       "          4.12310563,   5.09901951,   6.08276253,   7.07106781,\n",
       "          8.06225775,   9.05538514],\n",
       "       [  2.        ,   2.23606798,   2.82842712,   3.60555128,\n",
       "          4.47213595,   5.38516481,   6.32455532,   7.28010989,\n",
       "          8.24621125,   9.21954446],\n",
       "       [  3.        ,   3.16227766,   3.60555128,   4.24264069,\n",
       "          5.        ,   5.83095189,   6.70820393,   7.61577311,\n",
       "          8.54400375,   9.48683298],\n",
       "       [  4.        ,   4.12310563,   4.47213595,   5.        ,\n",
       "          5.65685425,   6.40312424,   7.21110255,   8.06225775,\n",
       "          8.94427191,   9.8488578 ],\n",
       "       [  5.        ,   5.09901951,   5.38516481,   5.83095189,\n",
       "          6.40312424,   7.07106781,   7.81024968,   8.60232527,\n",
       "          9.43398113,  10.29563014],\n",
       "       [  6.        ,   6.08276253,   6.32455532,   6.70820393,\n",
       "          7.21110255,   7.81024968,   8.48528137,   9.21954446,\n",
       "         10.        ,  10.81665383],\n",
       "       [  7.        ,   7.07106781,   7.28010989,   7.61577311,\n",
       "          8.06225775,   8.60232527,   9.21954446,   9.89949494,\n",
       "         10.63014581,  11.40175425],\n",
       "       [  8.        ,   8.06225775,   8.24621125,   8.54400375,\n",
       "          8.94427191,   9.43398113,  10.        ,  10.63014581,\n",
       "         11.3137085 ,  12.04159458],\n",
       "       [  9.        ,   9.05538514,   9.21954446,   9.48683298,\n",
       "          9.8488578 ,  10.29563014,  10.81665383,  11.40175425,\n",
       "         12.04159458,  12.72792206]])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sqrt(X**2 + Y**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Fancy Indexing\n",
    "Indexes for `numpy` arrays can be more than simple numbers and slices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "G = np.arange(10) - 5\n",
    "G"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False, False, False, False, False, False,  True,  True,  True,  True], dtype=bool)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M = G > 0 # This creates a boolean array. It can be used as a mask.\n",
    "M"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 4])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "G[M] # This picks only the elements, for which the mask is True.\n",
    "# Very easy way to filter data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-3,  2])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "G[[2,7]] # Pick only few indices using a list or array as the index."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All this also works on multiple dimensions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Structured Arrays\n",
    "Arrays can hold different data types in their columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# a list of people\n",
    "names = [ \"Aaron\", \"Freddy\", \"Xavier\", \"Kyong\", \"Carole\", \"Arla\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Let's generate some sample data\n",
    "from numpy.random import normal # normal distribution\n",
    "height = normal(loc=1.75, scale=0.2, size=len(names))\n",
    "weight = normal(loc=75, scale=15, size=len(names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Aaron', 1.6415660400240297, 50.060908347007967),\n",
       " ('Freddy', 1.377722441962725, 52.688480059609802),\n",
       " ('Xavier', 1.5195474593946803, 70.869882382122924),\n",
       " ('Kyong', 1.9918895922796507, 99.821972101433801),\n",
       " ('Carole', 1.7294163329715906, 90.685158799677012),\n",
       " ('Arla', 1.7767299770080927, 51.06830862621657)]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# generate a combined list\n",
    "L = list(zip(names, height, weight))\n",
    "L"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([['Aaron', '1.64156604002', '50.060908347'],\n",
       "       ['Freddy', '1.37772244196', '52.6884800596'],\n",
       "       ['Xavier', '1.51954745939', '70.8698823821'],\n",
       "       ['Kyong', '1.99188959228', '99.8219721014'],\n",
       "       ['Carole', '1.72941633297', '90.6851587997'],\n",
       "       ['Arla', '1.77672997701', '51.0683086262']], \n",
       "      dtype='<U13')"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = np.array(L)\n",
    "A"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This goes back to the most general common datatype, a string in this case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We want a mixed data type to be able to treat numbers as numbers. Also fields should have some kind of description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "A = np.array(L, dtype=[('name','O'), ('height','f8'), ('weight','f8')])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([('Aaron',  1.64156604,  50.06090835),\n",
       "       ('Freddy',  1.37772244,  52.68848006),\n",
       "       ('Xavier',  1.51954746,  70.86988238),\n",
       "       ('Kyong',  1.99188959,  99.8219721 ),\n",
       "       ('Carole',  1.72941633,  90.6851588 ),\n",
       "       ('Arla',  1.77672998,  51.06830863)], \n",
       "      dtype=[('name', 'O'), ('height', '<f8'), ('weight', '<f8')])"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We store the strings as generic \"objects\". This avoids setting an upper limit on the string length beforehand and also solves several portability issues between Python 2 and 3.\n",
    "\n",
    "Reconsider this choice for very large datasets, where total memory is an issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Freddy',  1.37772244,  52.68848006)"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A[1] # Prints all entries from the second row. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.64156604,  1.37772244,  1.51954746,  1.99188959,  1.72941633,\n",
       "        1.77672998])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A['height'] # just the height column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Kyong\n",
      "Kyong\n"
     ]
    }
   ],
   "source": [
    "# Arbitrary combinations are possible.\n",
    "print(A[3]['name'])\n",
    "print(A['name'][3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([('Aaron',  1.64156604), ('Freddy',  1.37772244),\n",
       "       ('Xavier',  1.51954746)], \n",
       "      dtype=[('name', 'O'), ('height', '<f8')])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Indexing rules work as expected.\n",
    "A[['name','height']][:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Sorting\n",
    "You can sort by individual fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([('Freddy',  1.37772244,  52.68848006),\n",
       "       ('Xavier',  1.51954746,  70.86988238),\n",
       "       ('Aaron',  1.64156604,  50.06090835),\n",
       "       ('Carole',  1.72941633,  90.6851588 ),\n",
       "       ('Arla',  1.77672998,  51.06830863),\n",
       "       ('Kyong',  1.99188959,  99.8219721 )], \n",
       "      dtype=[('name', 'O'), ('height', '<f8'), ('weight', '<f8')])"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sort(A, order='height') # This creates a new sorted array and leaves the original one untouched."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "A.sort(order='weight') # This changes the order of the original array."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Extending Structured Arrays\n",
    "Arrays have a fixed types. To extend the fields we need to create a new array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 18.57727489,  16.17739593,  27.7582578 ,  30.69256431,\n",
       "        30.32055213,  25.15913009])"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "BMI = A['weight'] / A['height']**2\n",
    "BMI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/tux/.local/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: Assignment between structured arrays with different field names will change in numpy 1.13.\n",
      "\n",
      "Previously fields in the dst would be set to the value of the identically-named field in the src. In numpy 1.13 fields will instead be assigned 'by position': The Nth field of the dst will be set to the Nth field of the src array.\n",
      "\n",
      "See the release notes for details\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    }
   ],
   "source": [
    "B = np.array(A, dtype=[('name','O'), ('height','f8'), ('weight','f8'), ('BMI','f8')])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([('Aaron',  1.64156604,  50.06090835,  0.),\n",
       "       ('Arla',  1.77672998,  51.06830863,  0.),\n",
       "       ('Freddy',  1.37772244,  52.68848006,  0.),\n",
       "       ('Xavier',  1.51954746,  70.86988238,  0.),\n",
       "       ('Carole',  1.72941633,  90.6851588 ,  0.),\n",
       "       ('Kyong',  1.99188959,  99.8219721 ,  0.)], \n",
       "      dtype=[('name', 'O'), ('height', '<f8'), ('weight', '<f8'), ('BMI', '<f8')])"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "B['BMI'] = BMI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Freddy', 'Xavier', 'Carole', 'Kyong'], dtype=object)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B['name'][B['BMI'] > 25.] # Selecting data is easy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Record Arrays\n",
    "These provide a convenient interface to structured arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "C = np.rec.array(B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 18.57727489,  16.17739593,  27.7582578 ,  30.69256431,\n",
       "        30.32055213,  25.15913009])"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C.weight / C.height**2 # access via attributes possible"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Input / Output\n",
    "`numpy` has many routines for reading and writing data in plain text and binary form."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "R = np.random.rand(30, 4) * 100 - 50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.savetxt('mydata.dat', R) # writes the table into a plain text file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "def printlines(fname, n=4):\n",
    "    \"\"\"Print the first n lines of the file fname.\"\"\"\n",
    "    with open('mydata.dat','r') as f:\n",
    "        for i in range(n):\n",
    "            print(f.readline().strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.784805495288934196e+01 4.173971821362576407e+01 4.998161922760758102e+01 1.783026450910197980e+01\n",
      "-4.022866923187805810e+01 1.001728759925784829e+01 -4.126332461183306322e+01 -1.907670533747464248e+01\n",
      "-3.051301990960874022e+01 1.498539902764959209e+01 2.293480495792766760e+01 4.709002459397736118e+01\n",
      "-3.253390068379028577e+01 5.470338543828347611e+00 -2.605682913043973059e+01 1.833121045153937700e+00\n"
     ]
    }
   ],
   "source": [
    "printlines('mydata.dat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The default format is `'%.18e'`, 18 decimal digits in exponential notation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "np.savetxt('mydata.dat', R, fmt='%.2g', delimiter='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "18\t42\t50\t18\n",
      "-40\t10\t-41\t-19\n",
      "-31\t15\t23\t47\n",
      "-33\t5.5\t-26\t1.8\n"
     ]
    }
   ],
   "source": [
    "printlines('mydata.dat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This can be used to create simple CSV (comma-separated values). The `csv` package offers more comprehensive support, specifically also for data exchange with spreadsheet software."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`savetxt` also takes open file handles as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "with open('mydata.dat','wb') as f:\n",
    "    # First write a header field.\n",
    "    f.write(b\"#A\\tB\\tfield3\\tfour\\n\")\n",
    "    # Now save the data to the open file.\n",
    "    np.savetxt(f, R, fmt=\"%.2g\", delimiter='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#A\tB\tfield3\tfour\n",
      "18\t42\t50\t18\n",
      "-40\t10\t-41\t-19\n",
      "-31\t15\t23\t47\n"
     ]
    }
   ],
   "source": [
    "printlines('mydata.dat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Reading plain text files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 18.  ,  42.  ,  50.  ,  18.  ],\n",
       "       [-40.  ,  10.  , -41.  , -19.  ],\n",
       "       [-31.  ,  15.  ,  23.  ,  47.  ],\n",
       "       [-33.  ,   5.5 , -26.  ,   1.8 ],\n",
       "       [ 49.  ,  31.  ,   0.92, -18.  ],\n",
       "       [ 11.  ,  37.  ,   4.4 ,  -5.4 ],\n",
       "       [ 25.  ,  46.  , -20.  ,   5.9 ],\n",
       "       [ 18.  , -38.  , -19.  ,  -5.7 ],\n",
       "       [ 29.  ,  21.  ,  44.  ,  17.  ],\n",
       "       [-36.  ,  15.  ,  13.  ,  -9.1 ],\n",
       "       [-30.  ,  -9.9 , -26.  ,  20.  ],\n",
       "       [-12.  ,   4.8 , -48.  ,  31.  ],\n",
       "       [ -4.2 ,   4.  , -34.  ,  32.  ],\n",
       "       [-31.  ,  29.  ,  48.  ,  18.  ],\n",
       "       [ 13.  , -25.  ,  45.  ,  35.  ],\n",
       "       [ 39.  , -19.  ,  11.  ,  -8.  ],\n",
       "       [-44.  , -39.  , -37.  , -38.  ],\n",
       "       [  5.1 , -23.  ,  11.  , -19.  ],\n",
       "       [-44.  , -34.  ,  20.  , -13.  ],\n",
       "       [ -6.1 ,  -5.8 ,   6.6 ,  26.  ],\n",
       "       [ 33.  , -48.  , -43.  ,  11.  ],\n",
       "       [ 19.  ,  36.  , -37.  ,  29.  ],\n",
       "       [-18.  , -34.  , -14.  , -14.  ],\n",
       "       [ 42.  ,  38.  ,  48.  , -25.  ],\n",
       "       [ 14.  , -43.  , -20.  ,   3.6 ],\n",
       "       [-33.  , -35.  ,  11.  ,  24.  ],\n",
       "       [ 48.  , -33.  ,  31.  , -24.  ],\n",
       "       [ 17.  ,  34.  ,   6.5 ,  27.  ],\n",
       "       [-48.  , -49.  , -24.  ,  23.  ],\n",
       "       [ 26.  ,  40.  ,  18.  ,  37.  ]])"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "E = np.loadtxt('mydata.dat')\n",
    "E"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The comment character (`#`) can also be changed. The keyword argument `skiprows` is useful when skipping information at the beginning of a file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`genfromtxt` is a more powerful version of loadtxt. It can read field directly from a header line and can apply arbitrary conversions to columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "F = np.genfromtxt('mydata.dat', names=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([( 18. ,  42. ,  50.  ,  18. ), (-40. ,  10. , -41.  , -19. ),\n",
       "       (-31. ,  15. ,  23.  ,  47. ), (-33. ,   5.5, -26.  ,   1.8),\n",
       "       ( 49. ,  31. ,   0.92, -18. ), ( 11. ,  37. ,   4.4 ,  -5.4),\n",
       "       ( 25. ,  46. , -20.  ,   5.9), ( 18. , -38. , -19.  ,  -5.7),\n",
       "       ( 29. ,  21. ,  44.  ,  17. ), (-36. ,  15. ,  13.  ,  -9.1),\n",
       "       (-30. ,  -9.9, -26.  ,  20. ), (-12. ,   4.8, -48.  ,  31. ),\n",
       "       ( -4.2,   4. , -34.  ,  32. ), (-31. ,  29. ,  48.  ,  18. ),\n",
       "       ( 13. , -25. ,  45.  ,  35. ), ( 39. , -19. ,  11.  ,  -8. ),\n",
       "       (-44. , -39. , -37.  , -38. ), (  5.1, -23. ,  11.  , -19. ),\n",
       "       (-44. , -34. ,  20.  , -13. ), ( -6.1,  -5.8,   6.6 ,  26. ),\n",
       "       ( 33. , -48. , -43.  ,  11. ), ( 19. ,  36. , -37.  ,  29. ),\n",
       "       (-18. , -34. , -14.  , -14. ), ( 42. ,  38. ,  48.  , -25. ),\n",
       "       ( 14. , -43. , -20.  ,   3.6), (-33. , -35. ,  11.  ,  24. ),\n",
       "       ( 48. , -33. ,  31.  , -24. ), ( 17. ,  34. ,   6.5 ,  27. ),\n",
       "       (-48. , -49. , -24.  ,  23. ), ( 26. ,  40. ,  18.  ,  37. )], \n",
       "      dtype=[('A', '<f8'), ('B', '<f8'), ('field3', '<f8'), ('four', '<f8')])"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Binary Data\n",
    "Larger datasets take a lot of memory and a long time to read and write if saves as plain text. Using binary is much more efficient but we need additional information to read it."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}