Monday, February 4, 2019

Stanislas Morbieu: Dense matrices implementation in Python

Machine learning algorithms often use matrices to store data and compute operations such as multiplications or singular value decomposition. The purpose of this article is to see how matrices are implemented in Python: how the data is stored and how much memory it consumes.

Numpy and SciPy are two major libraries which provide repectivelely dense and sparse implementations. Both rely on Numpy ndarrays, so it is important to first look at how ndarrays are implemented. We will then compare this implementation (on the storage aspect only) to implementations in Python without external library. Last, we will see what Numpy dense matrices are compared to ndarrays.

Internals of a Numpy ndarray

Numpy is a C extension Python library. It therefore provides C API and code which will help us understand how data is stored. For instance, in C, every ndarray is a pointer to a PyArrayObject C structure. This structure is defined as follows:

typedef struct PyArrayObject {
    PyObject_HEAD
    char *data;
    int nd;
    npy_intp *dimensions;
    npy_intp *strides;
    PyObject *base;
    PyArray_Descr *descr;
    int flags;
    PyObject *weakreflist;
} PyArrayObject;

The PyArrayObject implementation helps us better understand what the fields are used for:

  • PyObject_HEAD is a macro used for Python internal use.
  • data is a pointer to the raw data buffer, i.e. the values contained in the array
  • nd is the number of dimensions
  • dimensions is the size in each dimension, i.e. the shape of the array
  • a.strides gives how far in bytes it has to step, for each dimension, to get to the next element of the array (next row, or next column)
  • base is a pointer to the original array in the case of a view
  • descr is a pointer to the type of the elements stored in the array
  • flags gives information about how the array is stored in memory
  • weakreflist is a used for weak references

Let's consider an example of an ndarray to explain with more details the fields:

a = np.array([[1, 2, 3],
              [4, 5, 6]])

We will first look at the nd and dimensions fields which give general informations about the shape of the ndarray. Then we will see how the elements of the array are stored. References (base and weakreflist fields) are discussed later. Last, the PyObject_HEAD field is presented.

Shape of the ndarray

The number of dimensions (nd in C) is obtained in Python with:

a.ndim

which outputs 2. The dimensions are the rows and the columns in this case. The size of the array in each dimension (dimensions in C) is obtained in Python with:

a.shape

which outputs (2, 3). It corresponds to the number of rows and the number of columns respectively. These two fields (ndim and shape) can be changed independently of the elements of the array. For instance, if we consider a one dimensional array x and an array y of shape (3, 2) with the same elements:

x = np.array([1, 2, 3, 4, 5, 6])
y = a.reshape(3, 2)

their data is the same:

y.data.tobytes() == a.data.tobytes() == x.data.tobytes()

evaluates to True. Let's get deeper to see how the data is stored.

How is the data stored?

The data is stored in a.data. Let's set b = a.data.tobytes(). The value a.itemsize gives the number of bytes used to store one element of the array. The value returned by len(b) is 48 which corresponds to the number of bytes a.nbytes and also to the product of a.itemsize (8) and the number of elements (6). The number of bytes used to store one element of the array depends on the type of the elements.

Types of the elements

We can see the type used (descr in C) with:

a.dtype

which outputs dtype('int64'). It means that the elements are ints stored with 64 bits each. We can change this type, and then see data as bytes with:

a = a.astype('int16')
b = a.data.tobytes()
print(b)

It outputs b'\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00'. Here, \x01 is the hexadecimal string representation of the bits contained in one byte: for instance 0 is represented by \x00 and 255 is represented by \xff (we can store \(256 = 2^8\) values in one byte). Since the values are now stored with 16 bits, i.e. 2 bytes, we can retrieve the first two elements of the array (1 and 2) with:

int.from_bytes(b[:2], 'little')
int.from_bytes(b[2:4], 'little')

Two questions arise:

  • What does little mean?
  • And why is 2 the second element and not 4?

Little endian or big endian

Here, little corresponds to little-endian. It means that we read the bytes from left to right to get the number represented. If instead we want the data to be stored in big-endian, we can change it and see the data with:

a = a.astype('>i2')
print(a.data.tobytes())

It now outputs b'\x00\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06'. Note here that each pair of bytes (corresponding to one element of the array) is inverted. We can read the array without changing the data but reading it as if it were stored in little-endian:

a.newbyteorder('<')

which outputs:

array([[ 256,  512,  768],
       [1024, 1280, 1536]], dtype=int16)

Let's understand what happened by looking at each element of the array (i.e. two bytes). With big-endian:

bytes in hexadecimal number (big-endian) bits
\x00 \x01 \(2^0 = 1\) 00000000 00000001
\x00 \x02 \(2^1 = 2\) 00000000 00000010
\x00 \x03 \(2^0 + 2^1 = 3\) 00000000 00000011
\x00 \x04 \(2^2 = 4\) 00000000 00000100
\x00 \x05 \(2^2 + 2^0 = 5\) 00000000 00000101
\x00 \x06 \(2^2 + 2^1 = 6\) 00000000 00000110

In this table, we wrote the least significant bit on the right (i.e. if this bit is 1 then it corresponds to \(2^0\) in the decomposition, the second from the right corresponds to \(2^1\) and so on). In big-endian, the bytes are read from right to left. If instead we read the same bytes in little-endian (i.e. the least important byte is on the left), we have the following numbers:

bytes in hexadecimal number (little-endian) bits (bytes are reodered)
\x00 \x01 \(2^8 = 256\) 00000001 00000000
\x00 \x02 \(2^9 = 512\) 00000010 00000000
\x00 \x03 \(2^8 + 2^9 = 768\) 00000011 00000000
\x00 \x04 \(2^{10} = 1024\) 00000100 00000000
\x00 \x05 \(2^8 + 2^{10} = 1280\) 00000101 00000000
\x00 \x06 \(2^9 + 2^{10} = 1536\) 00000110 00000000

Note that we inverted the order of the bytes but not all the bits as a whole: for \x00\x01 we get 00000001 00000000 in little-endian and not 10000000 00000000.

We saw how each element is stored individually. Let's see why the second element in the data buffer is 2 (first row and second column) and not 4 (first column and second row).

Column major or row major

The second element is 2 because the rows are considered to be laid out as contiguous blocks: 1, 2, 3, 4, 5, 6 (the first row is 1, 2, 3 which are contiguous and the second row is 4, 5, 6). We name it "row major" or "C-style contiguous". The flags property (a.flags) gives us how the data is actually stored in memory:

C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

In this case the data is stored in C-style. We can also store it as contiguous blocks of columns, called "Fortran-style contiguous" or "column major" with:

y = np.asfortranarray(a)

The flags changed:

C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

but the data didn't: a.data == y.data returns True. The data array is still indexed as C-style, even if the underlying data is stored in memory as Fortran-style. This unables to manipulate the array the same way independently of how it is stored in memory. Storing it as row major or column major influences on the efficiency of the operations on the array: for C contiguous memory layout, operations such as a.sum(axis=1) (sums the rows) are usually faster than column-wise operations since they access memory addresses which are next to each other. a.strides gives us how far in bytes it has to step to get to the next value, repectively for the next row and the next column. Here it is (6, 2) for C-style and (2, 4) for Fortran-style.

We saw how each data element is stored and how elements are laid out along with the different fields in the C structure and in Python to get this information. Let's move on to references.

References

When we create a view on an array, the data is not copied, a reference to the underlying data is stored. For instance, if we set the variable c as the two last colums of the array a with c = a[:; -2:], we can access the array a with c.base. The value c.flags.owndata is therefore False whereas a.flags.owndata is True. The base field is also pointing to the buffer object if the array is constructed by a buffer.

The C field weakreflist is a pointer to weak references. More information is given by the documentation of the weakref module.

PyObject_HEAD

There is one last thing in the C structure we didn't cover yet: PyObject_HEAD. PyObject_HEAD is a macro defined as (see source code of PyObject_HEAD):

PyObject ob_base;

It is used for objects without a varying length as it is the case with an ndarray.

PyObject is defined as (see source code of PyObject):

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

The macro _PyObject_HEAD_EXTRA is empty, except if you turn on reference tracing (Py_TRACE_REFS).

ob_refcnt is used for reference counting: If this count hits zero then CPython will free all the resources used by the object.

The ob_type member of this structure contains a pointer to the type (see the code source of the definition of PyTypeObject: PyTypeObject is an alias of struct _typeobject).

Python implementations without external library

We saw how data is stored in a Numpy ndarray. If we want to store an array in Python, we could either use the array module which stores data compactly and thus requires the type of each element to be the same, or use other containers such as lists.

The drawback of Python containers is that each element is an object and thus is implemented in C as a PyObject, with the following structure:

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Each object adds fields to this structure to store its data (as it is the case for the ndarray). So more memory is used to store the pointer to the type and the reference count and also the elements are not stored contiguously so the access is slower due to cache efficiency.

From ndarray to Numpy dense matrix

Numpy dense matrix is a thin layer on top of ndarray to provide facilities for linear algebra. The differences between dense matrices and ndarrays are given in the Numpy documentation. We can see their implementation in the source code of Numpy dense matrix. Their use is now deprecated, since one of their main advantages was the matrix multiplication with the * operator but PEP 465 introduced the @ operator to provide it.

Take away

Here are a few things to remember from this blog post:

  • An ndarray is way more efficient to store a dense matrix than a Python list of values.
  • We can compute the size of an ndarray a priori by multiplying the number of elements by the number of bytes to store one element and a few extra bytes which are negligible.
  • We can reduce the size of an array by selecting the number of bytes to encode each entry.
  • Transposition is very efficient since it is just a flag to set: the whole data does not change.


from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...