In working through the Cholesky, QR, and SVD methods for solving the normal regression equations, I got curious about the underlying implementations of each. I did QR and Cholesky in grad school (a great Numerical Analysis sequence). I never got around to SVD (I think we did some variation of tridiagonalization). Anyway, Cholesky is so simple that I think testing Cholesky generated more interesting points for me. There were two big issues and the "testing" post is (I think) longer than the "implementation" post.

Executing the tests brought up an interesting software engineering (code reuse) issue and creating good test inputs brought up an intersting mathematical issue. Read on!

Some Testing Helpers

In [1]:

import numpy as np
import itertools as it
import functools as ft
import timeit

In [2]:

# from python itertools docs:
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = it.tee(iterable)
    next(b, None)
    return it.izip(a, b)

def time_routines(routines, inputs, number=10):
    name_len = max(len(r.__name__) for r in routines)
    line_template = "%" + str(name_len) + "s : %6.4f"

    times = {}
    for r in routines:
        call = ft.partial(r, *inputs)
        times[r.__name__] = min(timeit.Timer(call).repeat(repeat=3, number=number))
    
    print "\n".join(line_template % (k, times[k]) for k in sorted(times, key=times.get))

def same_results(routines, inputs):
    try:
        results = {r.__name__:r(*inputs) for r in routines}
        return all(np.allclose(r1,r2) for r1,r2 in pairwise(results.values()))
    except: # quick-and-dirty
        return False

So, with these helper routines, we can take several different implementations and compare them two ways: (1) do they all give the same answer (same_results) and (2) what are their run times (time_routines). Note that if the inputs are mutated in-place, we will be solving different problems in both of these. We’ll address that in a second. same_results plays a little fast-and-loose because, in general, floating-point closeness is not transitive — but closeness treats us far better than floating-point equality. However, we’ll be content enough check that if close(a,b) and close(b,c), then we’re "all close enough".

The Reuse Issue

Now, there are two problems that come up when we use this testing code. Since our implementations are not mutating, we don’t have to worry about that. However, we do have a similar issue. We have two different data structures being returned. Our "naive" method returns a full NumPy array; we also have "packed" versions that return a flat array that is a packed version of the triangular matrix being returned. So, to check for equality of results we would either (1) define a custom comparison routine or (2) provide a converter that will massage the packed versions into a full version. Here’s code that shows the issue:

In [3]:

def makeFullLowerTri(n):
    X = np.empty((n,n))
    X[np.triu_indices(n)] = 0.0
    X[np.tril_indices(n)] = np.arange((n*(n+1))/2.0)
    return X
print makeFullLowerTri(3)

def makePackedLowerTri(n):
    return np.arange((n*(n+1))/2.0)
print makePackedLowerTri(3)

print same_results([makeFullLowerTri, makePackedLowerTri], (3,))

[[ 0.  0.  0.]
 [ 1.  2.  0.]
 [ 3.  4.  5.]]
[ 0.  1.  2.  3.  4.  5.]
False

BTW, you might see the 2.0 in the np.arange expression. If arange takes a float as its input, it produces an array of np.float64 (more technically, np.float_ … "machine floating point size"). If the input is an int, the dtype is also np.int_. Anyway, it’s just a trick to get the dtype squared away nicely.

So, here’s a quick "expander" which turns packed triangular into full.

In [4]:

def expand(packedTri, n):
    X = np.empty((n,n))
    X[np.triu_indices(n)] = 0.0
    X[np.tril_indices(n)] = packedTri
    return X
print np.allclose(expand(makePackedLowerTri(3), 3), makeFullLowerTri(3))

True

But, our same_results routine take one function to apply (not nested applications). That process of nesting applications of a function has a name: function composition. And, you might remember it from studying such wonders as the chain rule of derivatives. But, unless you are into functional programming and languages, you might not have played around with it much from a software engineering perspective. What we need is (1) a single function that is the result of (2) applying expand on makePackedLowerTri. Fortunately, Python makes it relatively easy to create such a critter.

In [5]:

def helper(main, post):
    ''' help out by creating/returning a process that calls *main* to do the "work"
        and then *post* to do the "post-processing"'''
    def inner(A):
        ''' compute the composition of two functions on A'''
        return post(main(A))
    return inner

Now, the next issue is that we don’t want to modify main – which will be our makePackedLowerTri. In particular, it returns one array and we want to keep it that way. That means that post (expand) is going to be missing an argument – the n value. If only there were a way to "partially" apply a function to some arguments and leave the rest "to be filled in later". But wait! There is:

In [6]:

# above, we did:  import functools as ft 
expand3 = ft.partial(expand, n=3)

expand3 is a function of one argument that has its n argument fixed at 3. It is as if we had typed (without the typing and with many more resuse possibilities):

def expand3(packedTri):
    X = np.empty((3,3))
    X[np.triu_indices(3)] = 0.0
    X[np.tril_indices(3)] = packedTri
    return X

A couple of quick notes. (1) We’ll see that we don’t have to lock down the value of n. We could compute it inside of expand or we could hack on helper or makePackedLowerTri. (2) Adding compositions is a problem, if we want to directly compare the work (memory, time, etc.) of the base functions. We’ll ignore that for now because we are comparing correctness.

Here’s how the pieces work together. makeFullLowerTri is a "direct" function that we give an arg to and get our usable result. For the packed version, we do a bit more work. We call helper and it makes a function for us. That function that it creates is the one we want to call. So, we have two calls: (1) we call helper and (2) we call the returned value (from helper). The second call gives us a comparable result. Here they are:

In [7]:

# simple case
print makeFullLowerTri(3)

# complex case
r = helper(makePackedLowerTri, expand3)
print r(3)

[[ 0.  0.  0.]
 [ 1.  2.  0.]
 [ 3.  4.  5.]]
[[ 0.  0.  0.]
 [ 1.  2.  0.]
 [ 3.  4.  5.]]

And here’s how we use it in same_results:

In [8]:

routines = [makeFullLowerTri, helper(makePackedLowerTri, expand3)]
print same_results(routines, (3,))

True

Generalizing

The process of composing functions – and composing more than two functions – seems like a nicely repeatable process. Can we make the above more general than just helper? The answer is "yes!". There used to be a python module (third-party) called functional that had a compose routine. However, I can’t find it anymore (it is reference in the python 3.1 functional HOWTO document). But, a little hacking and Googling found some resources and resulted in the following code:

A single arg solution
A *arg and **kwarg solution
A nice Python+functional writeup (although it doesn’t talk about composition explicitly)

In [9]:

def compose(*functions):
    ''' compose functions left-to-right being outer-to-inner 
            NOTE: algebrists rejoice that I specified that to prevent headaches '''
    def identity(x): 
        ''' just bulletproofing b/c we won't have empty compose() call
            NOTE:  it works like identity(composedFuncs()) 
                   not the equiv. composedFuncs(identity())'''
        return x 
    def applyTwo(remainder, current):
        ''' remainder1(remainder2(remainder3(current(finished)))) 
            finished is done
            remainder represents composition of remainder_1 .... remainder_n'''
        def helper(*finished):
            return remainder(current(*finished))
        # helper.__name__ = "helper(%s)" % current.__name__
        return helper
    return ft.reduce(applyTwo, functions, identity)

def foo(x):
    return "outer(" + x + ")"
def bar(x):
    return "inner(" + x + ")"
print foo(bar("mark"))
print compose(foo, bar)("mark")

outer(inner(mark))
outer(inner(mark))

Basically, we call applyTwo many times (on each function we are composing) and we get back a helper (uncomment the helper name assignment and print it out to see how it progresses) that interlocks the inner result to the outer composition.

I’ll give bonus points for someone that can show a way to pass keywordable arguments back from the inner function calls without requiring modifications to the return statements. It’s "easy" to either return "the usual" or a tuple that will be detupled by the next layer out. But, sending something that can be **kwarg-ed seems a bit more messy.

The Testcase Issue

To test Cholesky, we need "symmetric, postitive-definite" (SPD) matrices. Since Cholesky is a version of a "matrix square root", you can think about this as "we can only take the square root of a postive number" (ignoring \(i\)/complex-values). I was inspired by a stack overflow thread to use the following method. In particular, Arnold Neumaier had a nice answer that I wanted to dig into (prove to myself that it worked and see the mathematics behind it). Here is what I came up with.

I’ll start with a few facts that you can gather from a linear algebra or numerical methods textbook (and/or the Web).

The eigenvalues of a matrix are postive if, and only if, the matrix is SPD.
The eigenvalues of a diagonal matrix are the values of the diagonal.
The matrices \(A\) and \(B\) are called similar matrices if we can write \(A=P^{-1}BP\) for some invertible \(P\).
Similar matrices have the same characteristic polynomial, and hence, the same eigenvalues.
Orthogonal matrices have \(A^T=A^{-1}\)
There is a special type of orthogonal matrix called a Householder matrix. It has the form: \(H = I – 2vv^T\) for \(v\) an \(n\)-element column-vector (\(H\) is \(n \times n\)).
Since Householder matrices are made up of \(I\) and \(vv^T\), we can see quickly that they must be symmetric \(H=H^T\).

One corollary: since \(H=H^{T}=H^{-1}\), we have that \(A=HBH=H^{T}BH=H^{-1}BH\) means that applying a Householder matrix on the left and right of \(B\) creates an \(A\) that is similar to \(B\). That’s exactly what we will do to a diagonal matrix with positive entries on the diagonal. Thus, we start with a (simple) SPD matrix and create a similar matrix that is more interesting. Incidentally, one other fact:

The condition number of a matrix \(A\) is equal to the (absolute value of the) ratio of the biggest to the smallest eigenvalues of \(A\). That is, \(cond(A)=\left| \frac{max(eigval(A))}{min(eigval(A))} \right|\).

Creating an SPD Matrix from Random Vectors

So, if we set the eigenvalues appropriately, we can control the condition number to make a particular well-behaved or ill-behaved matrix. We can create a SPD test matrix \(T_{spd}\), with condition number \(C\), with the following steps:

\(d=[1\ uniform(1,C)\ C]\), with \(n-2\) elements being sampled by the uniform distribution.
\(D_+ = diag(d)\) be a \(n \times n\) matrix with diagonal \(d\)
\(w=[uniform(0,1)]\) with \(n\) elements
\(H=I-2ww^{T}\)
\(T_{spd}=HD_{+}H\)

Note, I didn’t title this "Generating a Random SPD Matrix" (if I say it elsewhere, I deserve a slap on the wrist). I have no clue what the sampling properties of the matrices we are making are.

Applying Householder Matrices

A quick reminder about some matrix transpose rules: \((A+B)^T=A^T+B^T\) and \((AB)^T=B^TA^T\). Also, if \(D\) is diagonal (and square), then \(D=D^T\).

There are two interesting facts about Householder matrices. From above, \(H = I – 2ww^T\). (1) Householder matrices are completely defined by the underlying vector used to create them (i.e., the \(w\) vector). (2) To multiply (or "apply") a Householder matrix \(H\) to \(A\) on the left or right \(HA\) or \(AH\), we don’t need to make the full \(n \times n\) matrix of \(H\). We can just use the \(w\) vector. Golub and VanLoan (3rd, page 211) point out that (and you can derive in about two lines):

\(HA=(I-{\beta}vv^t)=A-vw^T_L\) with \(w_L={\beta}A^Tv\), and,
\(AH=A(I-{\beta}vv^T)=A-w_Rv^T\) with \(w_R={\beta}Av\)

Further, we’re going to be working with a diagonal matrix filling in for \(A\). This means that for our \(HDH\), we have:

\[HDH = (HD)H = \color{red}{(D-vw^T_L)}H = \color{red}{(D-vw^T_L)} – w_Rv^T = {\star}_1\]

In this case, \(w^T_L=({\beta}D^Tv)^T=({\beta}Dv)^T={\beta}v^TD\) and \(w_R={\beta}\color{red}{(D-vw^T_L)}v\). The red is meant to highlight the fact that the first multiplication/expansion plays an important role in the second multiplication/expansion.

Substituting in:

\[{\star}_1 = (D-{\beta}vv^TD) – {\beta}(D-vw^T_L)vv^T = {\star}_2\]

The second term is a bit tricky:

\[\begin{align}
{\beta}(D-vw^T_L)vv^T&=&{\beta}(D-v({\beta}v^TD))vv^T \\
&=&{\beta}Dvv^T-{\beta}v{\beta}v^TDvv^T \\
&=&{\beta}Dvv^T-{\beta}^2vv^TDvv^T \\
\end{align}
\]

So, we have: \[{\star}_2 = (D-{\beta}vv^TD) – {\beta}Dvv^T \color{red}{+} {\beta}^2vv^TDvv^T = D – {\beta}vv^TD – {\beta}Dvv^T \color{red}{+} {\beta}^2vv^TDvv^T\]

I highlighted the two pluses because, in a first draft, I missed them and had some flipped signs running around the end of my equations. When we expanded the second term, it had a minus in front of it. Can’t lose the negatives!

A Nicer Form?

At this point, you might be stuck. You might try some various factorizations, completing-the-squares, consulting algebra books for quadratic forms, etc. Not that I’d ever do that. However, you might wonder if the simple \(HA=A-vw^T\) form can be carried out for two applications of \(H\) (one left and one right). That is, can we say something like: \(HAH = A – dot(\cdot,\cdot) – dot(\cdot,\cdot)\), for some appropriate arguments to \(dot\). Looking at \({\star}_2\), we might have the brilliant idea that \(v\)s should stay the same. If it ain’t broke, don’t fix it. Then, let’s rearrange the scalars (they are push overs). We may get to this form:

\[D – v\boxed{{\beta}v^TD} – \boxed{{\beta}Dv}v^T + {\beta}^2vv^TDvv^T\]

Things are looking hopeful. But! What do we do with the last term (the first term, \(D\), is a standalone in the single multiplication rules from G&VL). It has a \(v\) on the left and a \(v^T\) on the right. We want one or the other, not both! What do we do? We split the difference. Or, more to the point we split it into two equal pieces (halve it) and use both! Also, \(v^TDv\) is a scalar (check the dimensions if you don’t believe me). I’ll rename it \(\alpha\).

\[
D – v\color{green}{\boxed{{\beta}v^TD}} – \color{red}{\boxed{{\beta}Dv}}v^T
+ \color{red}{\boxed{\frac{1}{2}{\beta}^2{\alpha}v}}v^T
+ v\color{green}{\boxed{\frac{1}{2}{\beta}^2{\alpha}v^T}}
\]

Now, look at the \(v\color{green}{\square}\) terms. If we factor out \(-v\) and call the result \(w_\star\), we have \(w_{\star} = \color{green}{{\beta}v^TD – \frac{1}{2}{\beta}^2{\alpha}v^T}\).

Can we come up with something for \(\color{red}{\square}v^T\)? Let’s see what \(w^T_\star\) is: \(w^T_{\star} = ({\beta}v^TD)^T – (\frac{1}{2}{\beta}^2{\alpha}v^T)^T = \color{red}{{\beta}Dv – \frac{1}{2}{\beta}^2{\alpha}v}\). Does it look familiar? (The color might help!) Fortunately, the answer is yes! And I’ll take serendipity in my mathematics any day. \(w^T_{\star}\) is the sum of the coefficients on the \(v^T\) terms. So, with all that, we have:

\[HDH = D – vw_{\star} – w^T_{\star}v^T\]

Lastly (really!), \(v\) is a column-vector, so by inference \(w_\star\) must be a row-vector. You can also see this because \(w_\star\) is made from \(v^T\). Good. Now, in NumPy, if we mutiply two vectors, we get the element-wise result between the two. We want the np.outer product between the two vectors (np.dot would be great if we had multi-dimensional critters). We also don’t have to worry about the transposes in the last term when we use np.outer (the 1-D arrays are effectively "unoriented" vectors).

Some Code

Here we simply have basic code to either give us a fixed pair of vectors (for testing the tester – sort of like guarding the guardians) or to give us two random vectors with some conditioning constraints (for the generated matrix).

In [10]:

def make_random_vecs_with_cond(n, C, repeatable=True):
    if repeatable:
        d = np.linspace(1.0, C, n)
        v = np.arange(1.0, n+1) / (n+1)
    else:
        d = np.empty(n)
        d[0] = 1.0; d[n-1] = C
        d[1:n-1] = np.random.uniform(1.0, C, n-2)
        v = np.random.uniform(1.0, 100.0, n)
    return d,v

Dead Simple Implementation

Here is the super-simple-visually-verify implementation. We actually ignore all the hard work above and explicitly construct the Householder matrix … so that we can verify that our "clever" method produces the same numerical result. Always test yourself. Incidentally, to test the "gold standard", we’ll compute the eigenvalues of the result and make sure they are the ones we initually used above (the np.linspace call will produce a diagonal of [1, 5.5, 10] for the call below.

In [11]:

def make_spd_explicit(n, C):
    d,v = make_random_vecs_with_cond(n, C)
    v = v[:,np.newaxis] # explicitly convert to column vector
    nrm = (2.0 / (v.T.dot(v))).reshape(1)

    # explicitly create house for testing purposes
    house = np.eye(n) - nrm*v.dot(v.T)

    D = np.diag(d)
    direct = house.dot(D).dot(house)
    return direct

In [12]:

expected = make_spd_explicit(3, 10)
print expected
print np.linalg.eig(expected)[0]

[[ 3.02040816  2.75510204  2.20408163]
 [ 2.75510204  8.43877551  0.55102041]
 [ 2.20408163  0.55102041  5.04081633]]
[  1.   10.    5.5]

Better Method Using Our Hard Work

In [13]:

def make_spd(n,C):
    d,v = make_random_vecs_with_cond(n,C)
    
    # each of these should be O(n) steps
    beta = (2.0 / (v.T.dot(v)))
    dv = d*v # elem-wise version of Dv and v^TD
    alpha =  v.dot(dv)    
    w_star = (beta * dv) - (.5 * beta**2 * alpha * v)

    # not TOO clever, is we really want to avoid extra memory layouts
    # we could either a call to BLAS dger to update 
    # outer products in place (i.e., A -+ outer(v,w))
    clever = np.diag(d) - np.outer(v,w_star) - np.outer(w_star,v)
    
    return clever

In [14]:

actual = make_spd(3,10)
print np.allclose(expected, actual)
print np.linalg.eig(actual)[0]

True
[  1.   10.    5.5]

And if you stuck with me through all of that, go have a beer, take a walk, or something relaxing!

Addendum: Even More Clever-er-er

In [15]:

import scipy.linalg.blas as slab

In [16]:

w = np.random.uniform(0.0, 1.0, 3)
v = np.random.uniform(0.0, 1.0, 3)
W = np.arange(9.0).reshape(3,3)

print W - np.outer(v,w) - np.outer(w,v)

# FAILS to update in place:  
# see https://github.com/scipy/scipy/issues/5738
f = slab.get_blas_funcs('syr2', (W,v,w))
print f(-1,w,v,a=W,overwrite_a=True)

print W

[[-0.0129077   0.74418334  1.45810324]
 [ 2.74418334  3.54239773  4.44051274]
 [ 5.45810324  6.44051274  7.71944681]]
[[-0.0129077   0.74418334  1.45810324]
 [ 3.          3.54239773  4.44051274]
 [ 6.          7.          7.71944681]]
[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]

So, back to a "custom" solution:

In [17]:

!cat myheader.h

typedef enum CBLAS_ORDER     {CblasRowMajor=101, CblasColMajor=102} CBLAS_ORDER;
typedef enum CBLAS_TRANSPOSE {CblasNoTrans=111, CblasTrans=112, CblasConjTrans=113, CblasConjNoTrans=114} CBLAS_TRANSPOSE;
typedef enum CBLAS_UPLO      {CblasUpper=121, CblasLower=122} CBLAS_UPLO;
typedef enum CBLAS_DIAG      {CblasNonUnit=131, CblasUnit=132} CBLAS_DIAG;
typedef enum CBLAS_SIDE      {CblasLeft=141, CblasRight=142} CBLAS_SIDE;

void cblas_dsyr2(const enum CBLAS_ORDER order, 
                 const enum CBLAS_UPLO Uplo, 
                 const int N, 
                 const double alpha, 
                 const double *X, const int incX, 
                 const double *Y, const int incY, 
                 double *A, 
                 const int lda);

In [18]:

from cffi import FFI
ffi = FFI()

with open("./myheader.h") as header_file:
    ffi.cdef(header_file.read())
libblas = ffi.dlopen("/usr/lib64/libopenblas_openmp.so")

def get_dptr(arr):
    return ffi.cast("double *", arr.__array_interface__['data'][0])

def my_syr2(A,alpha,x,y):
    # A <- A + alpha x y.T + alpha y x.T (A is updated in place)
    # this is a special case that will only work correctly for A is diagonal
    # b/c of the the lower to upper triangle copy
    N = A.shape[0]
    assert N == A.shape[1] and N == x.size and N == y.size
    
    libblas.cblas_dsyr2(101, 122, # RowMajor, LowerTriangular
                        N, alpha, get_dptr(x), 1, get_dptr(y), 1,
                        get_dptr(A), N)
    # propogate lower triangle to upper triangle
    A[np.triu_indices(N,1)] = A[np.tril_indices(N,-1)] # only works b/c off diag are zero

In [19]:

w = np.random.uniform(0.0, 1.0, 3)
v = np.random.uniform(0.0, 1.0, 3)
W = np.arange(9.0).reshape(3,3)

expected = W - np.outer(v,w) - np.outer(w,v)
print "expected:\n", expected

# FAILS to update in place:  
# see https://github.com/scipy/scipy/issues/5738
my_syr2(W, -1, w, v)
print W

print np.allclose(expected,W)

expected:
[[-0.09167778  0.79466529  1.51937221]
 [ 2.79466529  3.73716468  4.59675651]
 [ 5.51937221  6.59675651  7.78031939]]
[[-0.09167778  2.79466529  5.51937221]
 [ 2.79466529  3.73716468  6.59675651]
 [ 5.51937221  6.59675651  7.78031939]]
False

In [20]:

def make_spd_opt(n,C):
    d,v = make_random_vecs_with_cond(n,C)
    
    # each of these should be O(n) steps
    beta = (2.0 / (v.T.dot(v)))
    dv = d*v # elem-wise version of Dv and v^TD
    alpha =  v.dot(dv)    
    w_star = (beta * dv) - (.5 * beta**2 * alpha * v)

    clever = np.diag(d)
    my_syr2(clever, -1, v, w_star)   
    return clever

In [21]:

expected = make_spd_explicit(3, 10)
actual = make_spd_opt(3,10)
print actual
print np.allclose(expected, actual)
print np.linalg.eig(actual)[0]

[[ 3.02040816  2.75510204  2.20408163]
 [ 2.75510204  8.43877551  0.55102041]
 [ 2.20408163  0.55102041  5.04081633]]
True
[  1.   10.    5.5]

However, since \(D=D^T\), it isn’t too confusing to hack the scipy.linalg.blas.dsyr2 call to work correctly. The call will only honor overwrite_a if the a is in Fortran order – which a.T will be:

In [22]:

def make_spd_opt2(n,C):
    d,v = make_random_vecs_with_cond(n,C)
    
    # each of these should be O(n) steps
    beta = (2.0 / (v.T.dot(v)))
    dv = d*v # elem-wise version of Dv and v^TD
    alpha =  v.dot(dv)    
    w_star = (beta * dv) - (.5 * beta**2 * alpha * v)

    clever = np.diag(d)
    slab.dsyr2(-1, w_star, v, a=clever.T, overwrite_a=True)
    
    # have to propogate (unless we wanted to just use a packed rep.)
    N = clever.shape[0]
    clever[np.triu_indices(N,1)] = clever[np.tril_indices(N,-1)]
    return clever

In [23]:

expected = make_spd_explicit(3, 10)
actual = make_spd_opt2(3,10)
print np.allclose(expected, actual)

True

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.

drsfenner.org

Blogging our way through …

Testing a Cholesky Implementation