I spent a bunch of time talking about using lower level libraries (LAPACK directly and via LAPACKE or hand wrappers). My next set of posts is going to look at manually implementing these techniques. Mostly, I wanted to spend somem time with the numerical algorithms. I last visited them about 15 years ago in graduate school. There are also several performance lessons along the way (demonstating the limits of a "just compile it" for optimization). For starters, we’ll implement the Cholesky decomposition which is very easy to describe mathematically.
Spin Up
from math import sqrt
import numpy as np
import numba
import itertools as it
import functools as ft
import operator as op
import timeit
Some Testing Helpers
I really need to factor these testing helpers out into a testing_utils.py
module for all these posts. The problem with that is the posts are no longer self-contained for reading and running. *sigh* Nothing’s perfect. Turns out, there was enough "testing" related code that I did factor it out into a standalone post. In turn, I placed the python code from that post in chol_test_post.py
and I’m going to import
it here.
from chol_test_post import (time_routines, same_results, expand, compose,
make_random_vecs_with_cond, make_spd_opt)
Basic Cholesky
As I said, the mathematical description of Cholesky is pretty simple. For a symmetric, positive-definite matrix (so says Atkinson, Intro. to Numerical Analysis — I forget the corner case of positive-semidefinite), we can factor it: \(A \rightarrow LL^T\) where \(L\) is a lower-triangular matrix. Sometimes, this is called a "matrix square root". Element wise, the diagonal and off diagonal entries of \(L\) are:
\[l_{kk} = \sqrt{a_{kk} – \sum^{k-1}_{j=1} l^2_{kj}}\] \[l_{ik} = \frac{1}{l_{kk}} \left( a_{ik} – \sum^{k-1}_{j=1} l_{ij} l_{kj} \right)\]
def cholesky0(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L (full sized, zeroed above diag)
"""
n = A.shape[0]
L = np.zeros_like(A)
# Perform the Cholesky decomposition
for row in xrange(n):
for col in xrange(row+1):
tmp_sum = np.dot(L[row,:col], L[col,:col])
if (row == col): # Diagonal elements
L[row, col] = sqrt(max(A[row,row] - tmp_sum, 0))
else:
L[row,col] = (1.0 / L[col,col]) * (A[row,col] - tmp_sum)
return L
We’re doing as direct an implementation as possible. But, that allows us to call on ContinuumIO’s great numba project to JIT-compile (just-in-time) our code. It helps (in some ways), as we’ll see below. I’ll throw a note out: I’m not a jit-jedi. More specifically, I’ve only really toyed around with numba
, so nothing I say here is necessarily current (numba
moves pretty fast in terms of features) or even best-practice.
@numba.jit(numba.double[:,:](numba.double[:,:]))
def cholesky1(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L (full sized, zeroed above diag)
"""
n = A.shape[0]
L = np.zeros_like(A)
# Perform the Cholesky decomposition
for row in xrange(n):
for col in xrange(row+1):
tmp_sum = np.dot(L[row,:col], L[col,:col])
if (row == col): # Diagonal elements
L[row, col] = sqrt(max(A[row,row] - tmp_sum, 0))
else:
L[row,col] = (1.0 / L[col,col]) * (A[row,col] - tmp_sum)
return L
I’m not a numba wizard, but it seemed that with my version, I had to eliminate np.dot
to enable numba
‘s nopython
-mode (full-speed-ahead-numba).
@numba.jit(numba.double[:,:](numba.double[:,:]), nopython=True)
def cholesky2(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L (full sized, zeroed above diag)
"""
n = A.shape[0]
L = np.zeros_like(A)
# Perform the Cholesky decomposition
for row in xrange(n):
for col in xrange(row+1):
# eliminating np.dot allows us to go numba nopython
# but, the release notes for numba 0.23.0
# reference: https://github.com/numba/numba/pull/1560
tmp_sum = 0.0
for j in xrange(col):
tmp_sum += L[row,j] * L[col,j]
if (row == col):
# diag elts.
L[row,col] = sqrt(A[row,row] - tmp_sum)
else:
# off-diag elts.
L[row,col] = (1.0 / L[col,col] * (A[row,col] - tmp_sum))
return L
One of the key rules of numerical linear algebra is that you gain performance when you don’t do work. Ha! Actually, that’s a general rule of life. But, here, it means that (1) when we know the structure of a matrix, then (2) we take advantage of that structure to do less work. Practically speaking, instead of working with a zero-ed starting matrix, we’ll start with raw (arbitrary) memory, and only use values that we know have been set properly. We’ll zero out the super-diagonal pieces separately.
@numba.jit(numba.double[:,:](numba.double[:,:]), nopython=True)
def cholesky3(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L (full sized, zeroed above diag)
"""
n = A.shape[0]
L = np.empty_like(A)
# Perform the Cholesky decomposition
for row in xrange(n):
for col in xrange(row+1):
tmp_sum = 0.0
for j in xrange(col):
tmp_sum += L[row,j] * L[col,j]
if (row == col):
# diag elts.
L[row,col] = sqrt(A[row,row] - tmp_sum)
else:
# off diag elts.
L[row,col] = (1.0 / L[col,col] * (A[row,col] - tmp_sum))
L[row, row+1:] = 0.0
return L
Packed Storage Calculations
Turns out, that we really were only scratching the surface of respecting the structure of the matrix. We were still using a \(n^2\) piece of memory for a triangular matrix when a \(\frac{n(n+1)}{2}\) piece of memory will do. Sure, it’s still \(O(n^2)\), but savings is savings. And saving 50% is a good thing. We’ll use a long, flat array to store each row of the triangular matrix. To help us out, we’ll definte some helpers to convert from row-and-columns "full" indices to a "straightline" packed index.
#if we were column packing
#@numba.jit(numba.uint(numba.uint, numba.uint, numba.uint), nopython=True)
#def col_packed_lt(r,c,n):
# return r + (c * (-c + 2*n + 1) * .5)
# this would be row-packed upper triangular
#@numba.jit(numba.uint(numba.uint, numba.uint, numba.uint), nopython=True)
#def row_packed(r,c,n):
# return col_packed(c,r,n)
@numba.jit(numba.uint(numba.uint, numba.uint, numba.uint), nopython=True)
def row_packed_lt(r,c,n):
return c + (r * (r+1)) * .5
And, for testing purposes, we may want to expand our packed form back to a full, normal form.
def expand_row_packed(L, n, cols=None):
E = np.empty((n,cols)) if cols else np.empty((n,n))
E[np.tril_indices(n, m=cols)] = L
E[np.triu_indices(n,1,m=cols)] = 0.0
return E
Here’s a conversion of [0, 1, 2, 3, 4, 5]
interpreted as a 3×3 lower triangular matrix:
L = np.arange(6)
expand_row_packed(L, 3)
The elements in column 0 of a lower-triangular matrix stored in row-packed form will be at:
N = 3
for i in range(N):
print row_packed_lt(i, 0, N)
Packed Cholesky Routines
@numba.jit(numba.double[:](numba.double[:,:]), nopython=True)
def cholesky_packed1(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L packed (row-wise) in a vector
"""
n = A.shape[0]
numElts = n * (n+1) / 2
L = np.empty(numElts)
for row in xrange(n):
for col in xrange(row+1):
if (row == col):
# diag elts
tmp_sum = 0.0
base = row_packed_lt(col, 0, n)
for j in xrange(col):
tmp_sum += L[base + j] ** 2
L[row_packed_lt(row,col,n)] = sqrt(A[row,row] - tmp_sum)
else:
# off diag elts
tmp_sum = 0.0
for j in xrange(col):
tmp_sum += L[row_packed_lt(row, 0, n) + j] * L[row_packed_lt(col, 0, n) + j]
L[row_packed_lt(row,col,n)] = ((1.0 / L[row_packed_lt(col,col,n)]) *
(A[row][col] - tmp_sum))
return L
And now, let’s factor out some come computations:
@numba.jit(numba.double[:](numba.double[:,:]), nopython=True)
def cholesky_packed2(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L packed (row-wise) in a vector
"""
n = A.shape[0]
numElts = n * (n+1) / 2
L = np.empty(numElts)
for row in xrange(n):
for col in xrange(row+1):
if (row == col): # Diagonal elements
tmp_sum = 0.0
base = row_packed_lt(col, 0, n)
for j in xrange(col):
tmp_sum += L[base + j] ** 2
L[row_packed_lt(row,col,n)] = sqrt(A[row,row] - tmp_sum)
else:
tmp_sum = 0.0
base1 = row_packed_lt(row, 0, n)
base2 = row_packed_lt(col, 0, n)
for j in xrange(col):
tmp_sum += L[base1 + j] * L[base2 + j]
#+= base1 and base2 here *crushed* the timing
L[base1+col] = ((1.0 / L[base2+col]) * (A[row][col] - tmp_sum))
return L
@numba.jit(numba.double[:](numba.double[:,:]), nopython=True)
def cholesky_packed3(A):
"""
Performs a Cholesky decomposition of on symmetric, pos-def A.
Returns lower-triangular L packed (row-wise) in a vector
"""
n = A.shape[0]
numElts = n * (n+1) / 2
L = np.empty(numElts)
for row in xrange(n):
for col in xrange(row+1):
if (row == col): # Diagonal elements
base = row_packed_lt(col, 0, n)
tmp_sum = np.sum(L[base:base+col]**2)
L[row_packed_lt(row,col,n)] = sqrt(A[row,row] - tmp_sum)
else:
base1 = row_packed_lt(row, 0, n)
base2 = row_packed_lt(col, 0, n)
b1p = base1 + col
b2p = base2 + col
tmp_sum = np.sum(L[base1:b1p] * L[base2:b2p])
L[b1p] = ((1.0 / L[b2p]) * (A[row][col] - tmp_sum))
return L
Testing (Correctness)
A = np.array([[6, 3, 4, 8],
[3, 6, 5, 1],
[4, 5, 10, 7],
[8, 1, 7, 25]], dtype=np.float64)
print "A:"
print A
print "L (np.linalg):"
print np.linalg.cholesky(A)
routines = [np.linalg.cholesky, cholesky0, cholesky1, cholesky2, cholesky3]
same_results(routines, (A,))
expander = ft.partial(expand_row_packed, n=A.shape[0])
routines = [np.linalg.cholesky] + [compose(expander, r)for r in (cholesky_packed1,
cholesky_packed2,
cholesky_packed3)]
same_results(routines, (A,))
Testing (Performance)
A = np.array([[6, 3, 4, 8],
[3, 6, 5, 1],
[4, 5, 10, 7],
[8, 1, 7, 25]], dtype=np.float64)
print "A:"
print A
time_routines([np.linalg.cholesky], (A,), number=10000)
time_routines([cholesky0, cholesky1, cholesky2, cholesky3], (A,), number=10000)
time_routines([cholesky_packed1, cholesky_packed2, cholesky_packed3], (A,), number=10000)
_all = [np.linalg.cholesky,
cholesky0, cholesky1, cholesky2, cholesky3,
cholesky_packed1, cholesky_packed2, cholesky_packed3]
time_routines(_all, (A,), number=1000)
Bigger Fish
What happens when we scale up to bigger input matrices? Well, since we need SPD (symmetric, positive-definite) matrices, we have to construct them. Doing that in a naive way has a major issue: we can get a horribly conditioned matrix (and Cholesky is sensitive to the square of the condition number). We can improve that by adding (relatively) large elements on the diagonal to push the matrix towards being "diagonally dominant".
n = np.random.random_integers(1,10,(5,1))
outer_n = np.outer(n,n)
print outer_n
print np.real_if_close(np.linalg.eig(outer_n)[0]) # remove small imaginary parts
print np.linalg.cond(outer_n) # can go to float("inf")
# condition normalizer; not-so-regular Tikhonov regularization
# pushes towards diagonally dominant
outer_n = np.outer(n,n) + 5 * np.eye(5)
print outer_n
print np.real_if_close(np.linalg.eig(outer_n)[0])
print np.linalg.cond(outer_n)
But, we’re going to make use of the make_spd_opt
routine we defined in our testing helpers. Here’s what it looks like (3 is the length of the diagonal, 10 is the condition number of the result).
A = make_spd_opt(3, 10)
print A
print np.linalg.eig(A)[0]
A = make_spd_opt(2**10, 10)
_all = [np.linalg.cholesky,
cholesky0, cholesky1, cholesky2, cholesky3,
cholesky_packed1, cholesky_packed2, cholesky_packed3]
time_routines(_all, (A,), number=1)
A = make_spd_opt(2**12, 10)
drop_pokey = [np.linalg.cholesky,
cholesky2, cholesky3]
time_routines(drop_pokey, (A,), number=1)
time_routines([cholesky_packed2], [A], number=1)
A = make_spd_opt(2**13, 10)
showdown = [np.linalg.cholesky, cholesky2, cholesky_packed2]
time_routines(showdown, [A], number=1)
When the data gets bigger, we can’t compete. Why not? Because while we are increasing the efficiency of the steps of computation, we’re not increasing the efficiency of data movement. And to solve that, generally we switch to "block-algoirthms" that rely on moving chunks (submatrices) of data around, instead of just single rows and/or columns. Here’s the fundamental issue laid out nicely and here’s how deep that rabbit hole goes.
If you want to implement a block-Cholesky, see:
- http://www.netlib.org/utk/papers/factor/node9.html
- http://www.cs.utexas.edu/users/flame/Notes/NotesOnCholReal.pdf
- https://www.cs.utexas.edu/users/plapack/icpp98/node2.html
I may come back to it myself, but I’ve had my fun and there are other fish to fry.
Also, even though the packed routines save a lot of work, the cost of indexing (and the lack of efficient memory transfer) outweighs the benefits. I was curious if the jit
-ed calls could be inlined and I found this: https://github.com/numba/numba/issues/160 which seems to say the answer is "yes, it is down automatically, if LLVM heuristics say to".
Additional Resources
You can grab a copy of this notebook.
Even better, you can view it using nbviewer.
License
Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.