I spent a bunch of time talking about using lower level libraries (LAPACK directly and via LAPACKE or hand wrappers). My next set of posts is going to look at manually implementing these techniques. Mostly, I wanted to spend somem time with the numerical algorithms. I last visited them about 15 years ago in graduate school. There are also several performance lessons along the way (demonstating the limits of a "just compile it" for optimization). For starters, we’ll implement the Cholesky decomposition which is very easy to describe mathematically.

Spin Up

In [2]:

from math import sqrt
import numpy as np
import numba
import itertools as it
import functools as ft
import operator as op
import timeit

Some Testing Helpers

I really need to factor these testing helpers out into a testing_utils.py module for all these posts. The problem with that is the posts are no longer self-contained for reading and running. *sigh* Nothing’s perfect. Turns out, there was enough "testing" related code that I did factor it out into a standalone post. In turn, I placed the python code from that post in chol_test_post.py and I’m going to import it here.

In [3]:

from chol_test_post import (time_routines, same_results, expand, compose, 
                           make_random_vecs_with_cond, make_spd_opt)

Basic Cholesky

As I said, the mathematical description of Cholesky is pretty simple. For a symmetric, positive-definite matrix (so says Atkinson, Intro. to Numerical Analysis — I forget the corner case of positive-semidefinite), we can factor it: \(A \rightarrow LL^T\) where \(L\) is a lower-triangular matrix. Sometimes, this is called a "matrix square root". Element wise, the diagonal and off diagonal entries of \(L\) are:

\[l_{kk} = \sqrt{a_{kk} – \sum^{k-1}_{j=1} l^2_{kj}}\] \[l_{ik} = \frac{1}{l_{kk}} \left( a_{ik} – \sum^{k-1}_{j=1} l_{ij} l_{kj} \right)\]

In [4]:

def cholesky0(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L (full sized, zeroed above diag)
    """
    n = A.shape[0]
    L = np.zeros_like(A)

    # Perform the Cholesky decomposition
    for row in xrange(n):
        for col in xrange(row+1):
            tmp_sum = np.dot(L[row,:col], L[col,:col])         
            if (row == col): # Diagonal elements
                L[row, col] = sqrt(max(A[row,row] - tmp_sum, 0))
            else:
                L[row,col] = (1.0 / L[col,col]) * (A[row,col] - tmp_sum)
    return L

We’re doing as direct an implementation as possible. But, that allows us to call on ContinuumIO’s great numba project to JIT-compile (just-in-time) our code. It helps (in some ways), as we’ll see below. I’ll throw a note out: I’m not a jit-jedi. More specifically, I’ve only really toyed around with numba, so nothing I say here is necessarily current (numba moves pretty fast in terms of features) or even best-practice.

In [5]:

@numba.jit(numba.double[:,:](numba.double[:,:]))
def cholesky1(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L (full sized, zeroed above diag)
    """
    n = A.shape[0]
    L = np.zeros_like(A)

    # Perform the Cholesky decomposition
    for row in xrange(n):
        for col in xrange(row+1):
            tmp_sum = np.dot(L[row,:col], L[col,:col])         
            if (row == col): # Diagonal elements
                L[row, col] = sqrt(max(A[row,row] - tmp_sum, 0))
            else:
                L[row,col] = (1.0 / L[col,col]) * (A[row,col] - tmp_sum)
    return L

I’m not a numba wizard, but it seemed that with my version, I had to eliminate np.dot to enable numba‘s nopython-mode (full-speed-ahead-numba).

In [6]:

@numba.jit(numba.double[:,:](numba.double[:,:]), nopython=True)
def cholesky2(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L (full sized, zeroed above diag)
    """
    n = A.shape[0]
    L = np.zeros_like(A)

    # Perform the Cholesky decomposition
    for row in xrange(n):
        for col in xrange(row+1):

            # eliminating np.dot allows us to go numba nopython
            # but, the release notes for numba 0.23.0
            # reference:  https://github.com/numba/numba/pull/1560
            tmp_sum = 0.0
            for j in xrange(col):
                tmp_sum += L[row,j] * L[col,j]

            if (row == col): 
                # diag elts.
                L[row,col] = sqrt(A[row,row] - tmp_sum)
            else:
                # off-diag elts.
                L[row,col] = (1.0 / L[col,col] * (A[row,col] - tmp_sum))
    return L

One of the key rules of numerical linear algebra is that you gain performance when you don’t do work. Ha! Actually, that’s a general rule of life. But, here, it means that (1) when we know the structure of a matrix, then (2) we take advantage of that structure to do less work. Practically speaking, instead of working with a zero-ed starting matrix, we’ll start with raw (arbitrary) memory, and only use values that we know have been set properly. We’ll zero out the super-diagonal pieces separately.

In [7]:

@numba.jit(numba.double[:,:](numba.double[:,:]), nopython=True)
def cholesky3(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L (full sized, zeroed above diag)
    """
    n = A.shape[0]
    L = np.empty_like(A)

    # Perform the Cholesky decomposition
    for row in xrange(n):
        for col in xrange(row+1):
            tmp_sum = 0.0
            for j in xrange(col):
                tmp_sum += L[row,j] * L[col,j]
            
            if (row == col): 
                # diag elts.
                L[row,col] = sqrt(A[row,row] - tmp_sum)
            else:
                # off diag elts.
                L[row,col] = (1.0 / L[col,col] * (A[row,col] - tmp_sum))
        L[row, row+1:] = 0.0
    return L

Packed Storage Calculations

Turns out, that we really were only scratching the surface of respecting the structure of the matrix. We were still using a \(n^2\) piece of memory for a triangular matrix when a \(\frac{n(n+1)}{2}\) piece of memory will do. Sure, it’s still \(O(n^2)\), but savings is savings. And saving 50% is a good thing. We’ll use a long, flat array to store each row of the triangular matrix. To help us out, we’ll definte some helpers to convert from row-and-columns "full" indices to a "straightline" packed index.

In [8]:

#if we were column packing
#@numba.jit(numba.uint(numba.uint, numba.uint, numba.uint), nopython=True)
#def col_packed_lt(r,c,n):
#    return r + (c * (-c + 2*n + 1) * .5)

# this would be row-packed upper triangular
#@numba.jit(numba.uint(numba.uint, numba.uint, numba.uint), nopython=True)
#def row_packed(r,c,n):
#    return col_packed(c,r,n)

@numba.jit(numba.uint(numba.uint, numba.uint, numba.uint), nopython=True)
def row_packed_lt(r,c,n):
    return c + (r * (r+1)) * .5

And, for testing purposes, we may want to expand our packed form back to a full, normal form.

In [9]:

def expand_row_packed(L, n, cols=None):
    E = np.empty((n,cols)) if cols else np.empty((n,n))
    E[np.tril_indices(n,  m=cols)] = L
    E[np.triu_indices(n,1,m=cols)] = 0.0
    return E

Here’s a conversion of [0, 1, 2, 3, 4, 5] interpreted as a 3×3 lower triangular matrix:

In [10]:

L = np.arange(6)
expand_row_packed(L, 3)

Out[10]:

array([[ 0.,  0.,  0.],
       [ 1.,  2.,  0.],
       [ 3.,  4.,  5.]])

The elements in column 0 of a lower-triangular matrix stored in row-packed form will be at:

In [11]:

N = 3
for i in range(N):
    print row_packed_lt(i, 0, N)

0
1
3

Packed Cholesky Routines

In [12]:

@numba.jit(numba.double[:](numba.double[:,:]), nopython=True)
def cholesky_packed1(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L packed (row-wise) in a vector
    """
    n = A.shape[0]
    numElts = n * (n+1) / 2
    L = np.empty(numElts)
    
    for row in xrange(n):
        for col in xrange(row+1):
            if (row == col): 
                # diag elts
                tmp_sum = 0.0
                base = row_packed_lt(col, 0, n)
                for j in xrange(col):
                    tmp_sum += L[base + j] ** 2

                L[row_packed_lt(row,col,n)] = sqrt(A[row,row] - tmp_sum)
            else:
                # off diag elts
                tmp_sum = 0.0
                for j in xrange(col):
                    tmp_sum += L[row_packed_lt(row, 0, n) + j] * L[row_packed_lt(col, 0, n) + j]
                L[row_packed_lt(row,col,n)] = ((1.0 / L[row_packed_lt(col,col,n)]) * 
                                               (A[row][col] - tmp_sum))
    return L

And now, let’s factor out some come computations:

In [13]:

@numba.jit(numba.double[:](numba.double[:,:]), nopython=True)
def cholesky_packed2(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L packed (row-wise) in a vector
    """
    n = A.shape[0]
    numElts = n * (n+1) / 2
    L = np.empty(numElts)
    
    for row in xrange(n):
        for col in xrange(row+1):
            if (row == col): # Diagonal elements
                tmp_sum = 0.0
                base = row_packed_lt(col, 0, n)
                for j in xrange(col):
                    tmp_sum += L[base + j] ** 2
                L[row_packed_lt(row,col,n)] = sqrt(A[row,row] - tmp_sum)
            else:
                tmp_sum = 0.0
                base1 = row_packed_lt(row, 0, n)
                base2 = row_packed_lt(col, 0, n)
                for j in xrange(col):
                    tmp_sum += L[base1 + j] * L[base2 + j]
                #+= base1 and base2 here *crushed* the timing
                L[base1+col] = ((1.0 / L[base2+col]) * (A[row][col] - tmp_sum))
    return L

In [14]:

@numba.jit(numba.double[:](numba.double[:,:]), nopython=True)
def cholesky_packed3(A):
    """
       Performs a Cholesky decomposition of on symmetric, pos-def A.
       Returns lower-triangular L packed (row-wise) in a vector
    """
    n = A.shape[0]
    numElts = n * (n+1) / 2
    L = np.empty(numElts)
    
    for row in xrange(n):
        for col in xrange(row+1):
            if (row == col): # Diagonal elements
                base = row_packed_lt(col, 0, n)
                tmp_sum = np.sum(L[base:base+col]**2)
                L[row_packed_lt(row,col,n)] = sqrt(A[row,row] - tmp_sum)
            else:
                base1 = row_packed_lt(row, 0, n)
                base2 = row_packed_lt(col, 0, n)
                b1p = base1 + col
                b2p = base2 + col
                tmp_sum = np.sum(L[base1:b1p] * L[base2:b2p])
                L[b1p] = ((1.0 / L[b2p]) * (A[row][col] - tmp_sum))
    return L

Testing (Correctness)

In [15]:

A = np.array([[6, 3, 4, 8], 
              [3, 6, 5, 1], 
              [4, 5, 10, 7], 
              [8, 1, 7, 25]], dtype=np.float64)
print "A:"
print A

print "L (np.linalg):"
print np.linalg.cholesky(A)

A:
[[  6.   3.   4.   8.]
 [  3.   6.   5.   1.]
 [  4.   5.  10.   7.]
 [  8.   1.   7.  25.]]
L (np.linalg):
[[ 2.44948974  0.          0.          0.        ]
 [ 1.22474487  2.12132034  0.          0.        ]
 [ 1.63299316  1.41421356  2.30940108  0.        ]
 [ 3.26598632 -1.41421356  1.58771324  3.13249102]]

In [16]:

routines = [np.linalg.cholesky, cholesky0, cholesky1, cholesky2, cholesky3]
same_results(routines, (A,))

Out[16]:

True

In [17]:

expander = ft.partial(expand_row_packed, n=A.shape[0])
routines = [np.linalg.cholesky] + [compose(expander, r)for r in (cholesky_packed1, 
                                                                cholesky_packed2, 
                                                                cholesky_packed3)]
same_results(routines, (A,))

Out[17]:

True

Testing (Performance)

In [18]:

A = np.array([[6, 3, 4, 8], 
              [3, 6, 5, 1], 
              [4, 5, 10, 7], 
              [8, 1, 7, 25]], dtype=np.float64)
print "A:"
print A

A:
[[  6.   3.   4.   8.]
 [  3.   6.   5.   1.]
 [  4.   5.  10.   7.]
 [  8.   1.   7.  25.]]

In [19]:

time_routines([np.linalg.cholesky], (A,), number=10000)

cholesky : 0.1380

In [20]:

time_routines([cholesky0, cholesky1, cholesky2, cholesky3], (A,), number=10000)

cholesky2 : 0.0064
cholesky3 : 0.0065
cholesky0 : 0.2947
cholesky1 : 0.3358

In [21]:

time_routines([cholesky_packed1, cholesky_packed2, cholesky_packed3], (A,), number=10000)

cholesky_packed1 : 0.0073
cholesky_packed2 : 0.0076
cholesky_packed3 : 0.0179

In [22]:

_all = [np.linalg.cholesky, 
        cholesky0, cholesky1, cholesky2, cholesky3,
        cholesky_packed1, cholesky_packed2, cholesky_packed3]
time_routines(_all, (A,), number=1000)

       cholesky2 : 0.0006
       cholesky3 : 0.0006
cholesky_packed1 : 0.0007
cholesky_packed2 : 0.0007
cholesky_packed3 : 0.0017
        cholesky : 0.0138
       cholesky0 : 0.0296
       cholesky1 : 0.0336

Bigger Fish

What happens when we scale up to bigger input matrices? Well, since we need SPD (symmetric, positive-definite) matrices, we have to construct them. Doing that in a naive way has a major issue: we can get a horribly conditioned matrix (and Cholesky is sensitive to the square of the condition number). We can improve that by adding (relatively) large elements on the diagonal to push the matrix towards being "diagonally dominant".

In [23]:

n = np.random.random_integers(1,10,(5,1))
outer_n = np.outer(n,n)
print outer_n
print np.real_if_close(np.linalg.eig(outer_n)[0]) # remove small imaginary parts
print np.linalg.cond(outer_n) # can go to float("inf")

# condition normalizer; not-so-regular Tikhonov regularization
# pushes towards diagonally dominant
outer_n = np.outer(n,n) + 5 * np.eye(5) 
print outer_n
print np.real_if_close(np.linalg.eig(outer_n)[0])
print np.linalg.cond(outer_n)

[[ 49  35  70  14  70]
 [ 35  25  50  10  50]
 [ 70  50 100  20 100]
 [ 14  10  20   4  20]
 [ 70  50 100  20 100]]
[  0.00000000e+00   2.78000000e+02   1.84889275e-32   8.20146785e-17
   0.00000000e+00]
3.80552400456e+49
[[  54.   35.   70.   14.   70.]
 [  35.   30.   50.   10.   50.]
 [  70.   50.  105.   20.  100.]
 [  14.   10.   20.    9.   20.]
 [  70.   50.  100.   20.  105.]]
[   5.  283.    5.    5.    5.]
56.6

But, we’re going to make use of the make_spd_opt routine we defined in our testing helpers. Here’s what it looks like (3 is the length of the diagonal, 10 is the condition number of the result).

In [24]:

A = make_spd_opt(3, 10)
print A
print np.linalg.eig(A)[0]

[[ 3.02040816  2.75510204  2.20408163]
 [ 2.75510204  8.43877551  0.55102041]
 [ 2.20408163  0.55102041  5.04081633]]
[  1.   10.    5.5]

In [24]:

A = make_spd_opt(2**10, 10)
_all = [np.linalg.cholesky, 
        cholesky0, cholesky1, cholesky2, cholesky3,
        cholesky_packed1, cholesky_packed2, cholesky_packed3]
time_routines(_all, (A,), number=1)

        cholesky : 0.0240
       cholesky3 : 0.1463
       cholesky2 : 0.1465
cholesky_packed2 : 0.2107
cholesky_packed1 : 0.2144
cholesky_packed3 : 0.3146
       cholesky0 : 1.1020
       cholesky1 : 1.3282

In [25]:

A = make_spd_opt(2**12, 10)
drop_pokey = [np.linalg.cholesky,
              cholesky2, cholesky3]
time_routines(drop_pokey, (A,), number=1)

 cholesky : 0.8519
cholesky3 : 10.5039
cholesky2 : 10.5152

In [26]:

time_routines([cholesky_packed2], [A], number=1)

cholesky_packed2 : 14.2552

In [27]:

A = make_spd_opt(2**13, 10)
showdown = [np.linalg.cholesky, cholesky2, cholesky_packed2]
time_routines(showdown, [A], number=1)

        cholesky : 5.3662
       cholesky2 : 82.1272
cholesky_packed2 : 113.4598

When the data gets bigger, we can’t compete. Why not? Because while we are increasing the efficiency of the steps of computation, we’re not increasing the efficiency of data movement. And to solve that, generally we switch to "block-algoirthms" that rely on moving chunks (submatrices) of data around, instead of just single rows and/or columns. Here’s the fundamental issue laid out nicely and here’s how deep that rabbit hole goes.

If you want to implement a block-Cholesky, see:

http://www.netlib.org/utk/papers/factor/node9.html
http://www.cs.utexas.edu/users/flame/Notes/NotesOnCholReal.pdf
https://www.cs.utexas.edu/users/plapack/icpp98/node2.html

I may come back to it myself, but I’ve had my fun and there are other fish to fry.

Also, even though the packed routines save a lot of work, the cost of indexing (and the lack of efficient memory transfer) outweighs the benefits. I was curious if the jit-ed calls could be inlined and I found this: https://github.com/numba/numba/issues/160 which seems to say the answer is "yes, it is down automatically, if LLVM heuristics say to".

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.

drsfenner.org

Blogging our way through …

Basic Cholesky Implementation