# `numpy`
## A multidimensional array framework and more

In [1]:
import numpy as np

In [2]:
A = np.zeros((10,5)) # create an array filled with zeros
A

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [3]:
A.dtype # default data type is f8, aka double-precision floating point (8 bytes per number)

dtype('float64')

In [4]:
B = np.zeros((10,5), dtype='i4') # 4 byte (32 bit) integer

In [5]:
B.dtype

dtype('int32')

Some common data types:
- `f` floating-point number: `f4` single precision, `f8` double precision
- `i` (signed) integer number: `i4` 32-bit integer, `i8` 64-bit integer
- `u` unsigned integer
- `c` complex floating-point: `c16` double-precision for real and imaginary part
- `S` string
- `O` arbitrary Python objects (inefficent but flexible)

Arrays can be initialized using anything iterable. Most commonly this is a (nested) list.

In [6]:
C = np.array([[1,2,3],[4,5,6],[7,8,9]])
C

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [7]:
C[0,2] # indices always start at 0

3

### Array Operations
Many operators are overloaded so that operations are applied element-wise.

In [8]:
2 * C

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [9]:
C + C**2

array([[ 2,  6, 12],
       [20, 30, 42],
       [56, 72, 90]])

In [10]:
C + C.T # .T transposes the array

array([[ 2,  6, 10],
       [ 6, 10, 14],
       [10, 14, 18]])

### Shapes
Arrays can easily be flattened (converted to 1D) or reshaped, provided that the total size does not change.

In [11]:
C.shape

(3, 3)

In [12]:
C.flatten()

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
D = np.arange(20) # like range but returns an array instead of an iterator
D

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [14]:
D.reshape((4,5))

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

### Broadcasting
For operations between two arrays to succeed the corresponding dimensions have to be equal or one of them has to be one. In the latter case the size-one dimension is broadcast over the entries of the other array in that dimension.

In [15]:
E = np.array([10, 20, 30])

In [16]:
C

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [17]:
E

array([10, 20, 30])

In [18]:
C + E

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

To facilitate efficient broadcasting, empty dimensions can be inserted using `None`.

In [19]:
F = E[None,:]

In [20]:
F

array([[10, 20, 30]])

In [21]:
C + F

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

Broadcasting can create large arrays from 1D arrays.

For example, compute radius on a 2D grid from 1D coordinate arrays.

In [22]:
X = np.arange(10)[:,None]
Y = np.arange(10)[None,:]

In [23]:
X.shape, Y.shape

((10, 1), (1, 10))

In [24]:
(X**2 + Y**2).shape

(10, 10)

In [25]:
np.sqrt(X**2 + Y**2)

array([[  0.        ,   1.        ,   2.        ,   3.        ,
          4.        ,   5.        ,   6.        ,   7.        ,
          8.        ,   9.        ],
       [  1.        ,   1.41421356,   2.23606798,   3.16227766,
          4.12310563,   5.09901951,   6.08276253,   7.07106781,
          8.06225775,   9.05538514],
       [  2.        ,   2.23606798,   2.82842712,   3.60555128,
          4.47213595,   5.38516481,   6.32455532,   7.28010989,
          8.24621125,   9.21954446],
       [  3.        ,   3.16227766,   3.60555128,   4.24264069,
          5.        ,   5.83095189,   6.70820393,   7.61577311,
          8.54400375,   9.48683298],
       [  4.        ,   4.12310563,   4.47213595,   5.        ,
          5.65685425,   6.40312424,   7.21110255,   8.06225775,
          8.94427191,   9.8488578 ],
       [  5.        ,   5.09901951,   5.38516481,   5.83095189,
          6.40312424,   7.07106781,   7.81024968,   8.60232527,
          9.43398113,  10.29563014],
       [  

## Fancy Indexing
Indexes for `numpy` arrays can be more than simple numbers and slices.

In [26]:
G = np.arange(10) - 5
G

array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4])

In [27]:
M = G > 0 # This creates a boolean array. It can be used as a mask.
M

array([False, False, False, False, False, False,  True,  True,  True,  True], dtype=bool)

In [28]:
G[M] # This picks only the elements, for which the mask is True.
# Very easy way to filter data.

array([1, 2, 3, 4])

In [29]:
G[[2,7]] # Pick only few indices using a list or array as the index.

array([-3,  2])

All this also works on multiple dimensions.

## Structured Arrays
Arrays can hold different data types in their columns.

In [30]:
# a list of people
names = [ "Aaron", "Freddy", "Xavier", "Kyong", "Carole", "Arla"]

In [31]:
# Let's generate some sample data
from numpy.random import normal # normal distribution
height = normal(loc=1.75, scale=0.2, size=len(names))
weight = normal(loc=75, scale=15, size=len(names))

In [32]:
# generate a combined list
L = list(zip(names, height, weight))
L

[('Aaron', 1.6415660400240297, 50.060908347007967),
 ('Freddy', 1.377722441962725, 52.688480059609802),
 ('Xavier', 1.5195474593946803, 70.869882382122924),
 ('Kyong', 1.9918895922796507, 99.821972101433801),
 ('Carole', 1.7294163329715906, 90.685158799677012),
 ('Arla', 1.7767299770080927, 51.06830862621657)]

In [33]:
A = np.array(L)
A

array([['Aaron', '1.64156604002', '50.060908347'],
       ['Freddy', '1.37772244196', '52.6884800596'],
       ['Xavier', '1.51954745939', '70.8698823821'],
       ['Kyong', '1.99188959228', '99.8219721014'],
       ['Carole', '1.72941633297', '90.6851587997'],
       ['Arla', '1.77672997701', '51.0683086262']], 
      dtype='<U13')

This goes back to the most general common datatype, a string in this case.

We want a mixed data type to be able to treat numbers as numbers. Also fields should have some kind of description.

In [34]:
A = np.array(L, dtype=[('name','O'), ('height','f8'), ('weight','f8')])

In [35]:
A

array([('Aaron',  1.64156604,  50.06090835),
       ('Freddy',  1.37772244,  52.68848006),
       ('Xavier',  1.51954746,  70.86988238),
       ('Kyong',  1.99188959,  99.8219721 ),
       ('Carole',  1.72941633,  90.6851588 ),
       ('Arla',  1.77672998,  51.06830863)], 
      dtype=[('name', 'O'), ('height', '<f8'), ('weight', '<f8')])

We store the strings as generic "objects". This avoids setting an upper limit on the string length beforehand and also solves several portability issues between Python 2 and 3.

Reconsider this choice for very large datasets, where total memory is an issue.

In [36]:
A[1] # Prints all entries from the second row. 

('Freddy',  1.37772244,  52.68848006)

In [37]:
A['height'] # just the height column

array([ 1.64156604,  1.37772244,  1.51954746,  1.99188959,  1.72941633,
        1.77672998])

In [38]:
# Arbitrary combinations are possible.
print(A[3]['name'])
print(A['name'][3])

Kyong
Kyong


In [39]:
# Indexing rules work as expected.
A[['name','height']][:3]

array([('Aaron',  1.64156604), ('Freddy',  1.37772244),
       ('Xavier',  1.51954746)], 
      dtype=[('name', 'O'), ('height', '<f8')])

### Sorting
You can sort by individual fields.

In [40]:
np.sort(A, order='height') # This creates a new sorted array and leaves the original one untouched.

array([('Freddy',  1.37772244,  52.68848006),
       ('Xavier',  1.51954746,  70.86988238),
       ('Aaron',  1.64156604,  50.06090835),
       ('Carole',  1.72941633,  90.6851588 ),
       ('Arla',  1.77672998,  51.06830863),
       ('Kyong',  1.99188959,  99.8219721 )], 
      dtype=[('name', 'O'), ('height', '<f8'), ('weight', '<f8')])

In [41]:
A.sort(order='weight') # This changes the order of the original array.

#### Extending Structured Arrays
Arrays have a fixed types. To extend the fields we need to create a new array.

In [42]:
BMI = A['weight'] / A['height']**2
BMI

array([ 18.57727489,  16.17739593,  27.7582578 ,  30.69256431,
        30.32055213,  25.15913009])

In [43]:
B = np.array(A, dtype=[('name','O'), ('height','f8'), ('weight','f8'), ('BMI','f8')])


Previously fields in the dst would be set to the value of the identically-named field in the src. In numpy 1.13 fields will instead be assigned 'by position': The Nth field of the dst will be set to the Nth field of the src array.

See the release notes for details
  """Entry point for launching an IPython kernel.


In [44]:
B

array([('Aaron',  1.64156604,  50.06090835,  0.),
       ('Arla',  1.77672998,  51.06830863,  0.),
       ('Freddy',  1.37772244,  52.68848006,  0.),
       ('Xavier',  1.51954746,  70.86988238,  0.),
       ('Carole',  1.72941633,  90.6851588 ,  0.),
       ('Kyong',  1.99188959,  99.8219721 ,  0.)], 
      dtype=[('name', 'O'), ('height', '<f8'), ('weight', '<f8'), ('BMI', '<f8')])

In [45]:
B['BMI'] = BMI

In [46]:
B['name'][B['BMI'] > 25.] # Selecting data is easy.

array(['Freddy', 'Xavier', 'Carole', 'Kyong'], dtype=object)

### Record Arrays
These provide a convenient interface to structured arrays.

In [47]:
C = np.rec.array(B)

In [48]:
C.weight / C.height**2 # access via attributes possible

array([ 18.57727489,  16.17739593,  27.7582578 ,  30.69256431,
        30.32055213,  25.15913009])

## Input / Output
`numpy` has many routines for reading and writing data in plain text and binary form.

In [49]:
R = np.random.rand(30, 4) * 100 - 50

In [50]:
np.savetxt('mydata.dat', R) # writes the table into a plain text file

In [51]:
def printlines(fname, n=4):
    """Print the first n lines of the file fname."""
    with open('mydata.dat','r') as f:
        for i in range(n):
            print(f.readline().strip())

In [52]:
printlines('mydata.dat')

1.784805495288934196e+01 4.173971821362576407e+01 4.998161922760758102e+01 1.783026450910197980e+01
-4.022866923187805810e+01 1.001728759925784829e+01 -4.126332461183306322e+01 -1.907670533747464248e+01
-3.051301990960874022e+01 1.498539902764959209e+01 2.293480495792766760e+01 4.709002459397736118e+01
-3.253390068379028577e+01 5.470338543828347611e+00 -2.605682913043973059e+01 1.833121045153937700e+00


The default format is `'%.18e'`, 18 decimal digits in exponential notation.

In [53]:
np.savetxt('mydata.dat', R, fmt='%.2g', delimiter='\t')

In [54]:
printlines('mydata.dat')

18	42	50	18
-40	10	-41	-19
-31	15	23	47
-33	5.5	-26	1.8


This can be used to create simple CSV (comma-separated values). The `csv` package offers more comprehensive support, specifically also for data exchange with spreadsheet software.

`savetxt` also takes open file handles as an argument.

In [55]:
with open('mydata.dat','wb') as f:
    # First write a header field.
    f.write(b"#A\tB\tfield3\tfour\n")
    # Now save the data to the open file.
    np.savetxt(f, R, fmt="%.2g", delimiter='\t')

In [56]:
printlines('mydata.dat')

#A	B	field3	four
18	42	50	18
-40	10	-41	-19
-31	15	23	47


### Reading plain text files

In [57]:
E = np.loadtxt('mydata.dat')
E

array([[ 18.  ,  42.  ,  50.  ,  18.  ],
       [-40.  ,  10.  , -41.  , -19.  ],
       [-31.  ,  15.  ,  23.  ,  47.  ],
       [-33.  ,   5.5 , -26.  ,   1.8 ],
       [ 49.  ,  31.  ,   0.92, -18.  ],
       [ 11.  ,  37.  ,   4.4 ,  -5.4 ],
       [ 25.  ,  46.  , -20.  ,   5.9 ],
       [ 18.  , -38.  , -19.  ,  -5.7 ],
       [ 29.  ,  21.  ,  44.  ,  17.  ],
       [-36.  ,  15.  ,  13.  ,  -9.1 ],
       [-30.  ,  -9.9 , -26.  ,  20.  ],
       [-12.  ,   4.8 , -48.  ,  31.  ],
       [ -4.2 ,   4.  , -34.  ,  32.  ],
       [-31.  ,  29.  ,  48.  ,  18.  ],
       [ 13.  , -25.  ,  45.  ,  35.  ],
       [ 39.  , -19.  ,  11.  ,  -8.  ],
       [-44.  , -39.  , -37.  , -38.  ],
       [  5.1 , -23.  ,  11.  , -19.  ],
       [-44.  , -34.  ,  20.  , -13.  ],
       [ -6.1 ,  -5.8 ,   6.6 ,  26.  ],
       [ 33.  , -48.  , -43.  ,  11.  ],
       [ 19.  ,  36.  , -37.  ,  29.  ],
       [-18.  , -34.  , -14.  , -14.  ],
       [ 42.  ,  38.  ,  48.  , -25.  ],
       [ 14.  , 

The comment character (`#`) can also be changed. The keyword argument `skiprows` is useful when skipping information at the beginning of a file.

`genfromtxt` is a more powerful version of loadtxt. It can read field directly from a header line and can apply arbitrary conversions to columns.

In [58]:
F = np.genfromtxt('mydata.dat', names=True)

In [59]:
F

array([( 18. ,  42. ,  50.  ,  18. ), (-40. ,  10. , -41.  , -19. ),
       (-31. ,  15. ,  23.  ,  47. ), (-33. ,   5.5, -26.  ,   1.8),
       ( 49. ,  31. ,   0.92, -18. ), ( 11. ,  37. ,   4.4 ,  -5.4),
       ( 25. ,  46. , -20.  ,   5.9), ( 18. , -38. , -19.  ,  -5.7),
       ( 29. ,  21. ,  44.  ,  17. ), (-36. ,  15. ,  13.  ,  -9.1),
       (-30. ,  -9.9, -26.  ,  20. ), (-12. ,   4.8, -48.  ,  31. ),
       ( -4.2,   4. , -34.  ,  32. ), (-31. ,  29. ,  48.  ,  18. ),
       ( 13. , -25. ,  45.  ,  35. ), ( 39. , -19. ,  11.  ,  -8. ),
       (-44. , -39. , -37.  , -38. ), (  5.1, -23. ,  11.  , -19. ),
       (-44. , -34. ,  20.  , -13. ), ( -6.1,  -5.8,   6.6 ,  26. ),
       ( 33. , -48. , -43.  ,  11. ), ( 19. ,  36. , -37.  ,  29. ),
       (-18. , -34. , -14.  , -14. ), ( 42. ,  38. ,  48.  , -25. ),
       ( 14. , -43. , -20.  ,   3.6), (-33. , -35. ,  11.  ,  24. ),
       ( 48. , -33. ,  31.  , -24. ), ( 17. ,  34. ,   6.5 ,  27. ),
       (-48. , -49. , -24.  ,  23.

### Binary Data
Larger datasets take a lot of memory and a long time to read and write if saves as plain text. Using binary is much more efficient but we need additional information to read it.