In working through the Cholesky, QR, and SVD methods for solving the normal regression equations, I got curious about the underlying implementations of each. I did QR and Cholesky in grad school (a great Numerical Analysis sequence). I never got around to SVD (I think we did some variation of tridiagonalization). Anyway, Cholesky is so simple that I think testing Cholesky generated more interesting points for me. There were two big issues and the "testing" post is (I think) longer than the "implementation" post.
Executing the tests brought up an interesting software engineering (code reuse) issue and creating good test inputs brought up an intersting mathematical issue. Read on!
Some Testing Helpers
import numpy as np
import itertools as it
import functools as ft
import timeit
# from python itertools docs:
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = it.tee(iterable)
next(b, None)
return it.izip(a, b)
def time_routines(routines, inputs, number=10):
name_len = max(len(r.__name__) for r in routines)
line_template = "%" + str(name_len) + "s : %6.4f"
times = {}
for r in routines:
call = ft.partial(r, *inputs)
times[r.__name__] = min(timeit.Timer(call).repeat(repeat=3, number=number))
print "\n".join(line_template % (k, times[k]) for k in sorted(times, key=times.get))
def same_results(routines, inputs):
try:
results = {r.__name__:r(*inputs) for r in routines}
return all(np.allclose(r1,r2) for r1,r2 in pairwise(results.values()))
except: # quick-and-dirty
return False
So, with these helper routines, we can take several different implementations and compare them two ways: (1) do they all give the same answer (same_results
) and (2) what are their run times (time_routines
). Note that if the inputs are mutated in-place, we will be solving different problems in both of these. We’ll address that in a second. same_results
plays a little fast-and-loose because, in general, floating-point closeness is not transitive — but closeness treats us far better than floating-point equality. However, we’ll be content enough check that if close(a,b)
and close(b,c)
, then we’re "all close enough".
The Reuse Issue
Now, there are two problems that come up when we use this testing code. Since our implementations are not mutating, we don’t have to worry about that. However, we do have a similar issue. We have two different data structures being returned. Our "naive" method returns a full NumPy
array; we also have "packed" versions that return a flat array that is a packed version of the triangular matrix being returned. So, to check for equality of results we would either (1) define a custom comparison routine or (2) provide a converter that will massage the packed versions into a full version. Here’s code that shows the issue:
def makeFullLowerTri(n):
X = np.empty((n,n))
X[np.triu_indices(n)] = 0.0
X[np.tril_indices(n)] = np.arange((n*(n+1))/2.0)
return X
print makeFullLowerTri(3)
def makePackedLowerTri(n):
return np.arange((n*(n+1))/2.0)
print makePackedLowerTri(3)
print same_results([makeFullLowerTri, makePackedLowerTri], (3,))
BTW, you might see the 2.0
in the np.arange
expression. If arange
takes a float
as its input, it produces an array of np.float64
(more technically, np.float_
… "machine floating point size"). If the input is an int
, the dtype
is also np.int_
. Anyway, it’s just a trick to get the dtype
squared away nicely.
So, here’s a quick "expander" which turns packed triangular into full.
def expand(packedTri, n):
X = np.empty((n,n))
X[np.triu_indices(n)] = 0.0
X[np.tril_indices(n)] = packedTri
return X
print np.allclose(expand(makePackedLowerTri(3), 3), makeFullLowerTri(3))
But, our same_results
routine take one function to apply (not nested applications). That process of nesting applications of a function has a name: function composition. And, you might remember it from studying such wonders as the chain rule of derivatives. But, unless you are into functional programming and languages, you might not have played around with it much from a software engineering perspective. What we need is (1) a single function that is the result of (2) applying expand
on makePackedLowerTri
. Fortunately, Python makes it relatively easy to create such a critter.
def helper(main, post):
''' help out by creating/returning a process that calls *main* to do the "work"
and then *post* to do the "post-processing"'''
def inner(A):
''' compute the composition of two functions on A'''
return post(main(A))
return inner
Now, the next issue is that we don’t want to modify main
– which will be our makePackedLowerTri
. In particular, it returns one array and we want to keep it that way. That means that post
(expand
) is going to be missing an argument – the n
value. If only there were a way to "partially" apply a function to some arguments and leave the rest "to be filled in later". But wait! There is:
# above, we did: import functools as ft
expand3 = ft.partial(expand, n=3)
expand3
is a function of one argument that has its n
argument fixed at 3
. It is as if we had typed (without the typing and with many more resuse possibilities):
def expand3(packedTri):
X = np.empty((3,3))
X[np.triu_indices(3)] = 0.0
X[np.tril_indices(3)] = packedTri
return X
A couple of quick notes. (1) We’ll see that we don’t have to lock down the value of n
. We could compute it inside of expand
or we could hack on helper
or makePackedLowerTri
. (2) Adding compositions is a problem, if we want to directly compare the work (memory, time, etc.) of the base functions. We’ll ignore that for now because we are comparing correctness.
Here’s how the pieces work together. makeFullLowerTri
is a "direct" function that we give an arg to and get our usable result. For the packed version, we do a bit more work. We call helper
and it makes a function for us. That function that it creates is the one we want to call. So, we have two calls: (1) we call helper
and (2) we call the returned value (from helper
). The second call gives us a comparable result. Here they are:
# simple case
print makeFullLowerTri(3)
# complex case
r = helper(makePackedLowerTri, expand3)
print r(3)
And here’s how we use it in same_results
:
routines = [makeFullLowerTri, helper(makePackedLowerTri, expand3)]
print same_results(routines, (3,))
Generalizing
The process of composing functions – and composing more than two functions – seems like a nicely repeatable process. Can we make the above more general than just helper
? The answer is "yes!". There used to be a python module (third-party) called functional
that had a compose
routine. However, I can’t find it anymore (it is reference in the python 3.1 functional HOWTO document). But, a little hacking and Googling found some resources and resulted in the following code:
- A single arg solution
- A *arg and **kwarg solution
- A nice Python+functional writeup (although it doesn’t talk about composition explicitly)
def compose(*functions):
''' compose functions left-to-right being outer-to-inner
NOTE: algebrists rejoice that I specified that to prevent headaches '''
def identity(x):
''' just bulletproofing b/c we won't have empty compose() call
NOTE: it works like identity(composedFuncs())
not the equiv. composedFuncs(identity())'''
return x
def applyTwo(remainder, current):
''' remainder1(remainder2(remainder3(current(finished))))
finished is done
remainder represents composition of remainder_1 .... remainder_n'''
def helper(*finished):
return remainder(current(*finished))
# helper.__name__ = "helper(%s)" % current.__name__
return helper
return ft.reduce(applyTwo, functions, identity)
def foo(x):
return "outer(" + x + ")"
def bar(x):
return "inner(" + x + ")"
print foo(bar("mark"))
print compose(foo, bar)("mark")
Basically, we call applyTwo
many times (on each function we are composing) and we get back a helper
(uncomment the helper
name assignment and print it out to see how it progresses) that interlocks the inner result to the outer composition.
I’ll give bonus points for someone that can show a way to pass keywordable
arguments back from the inner function calls without requiring modifications to the return
statements. It’s "easy" to either return "the usual" or a tuple that will be detupled by the next layer out. But, sending something that can be **kwarg
-ed seems a bit more messy.
The Testcase Issue
To test Cholesky, we need "symmetric, postitive-definite" (SPD) matrices. Since Cholesky is a version of a "matrix square root", you can think about this as "we can only take the square root of a postive number" (ignoring \(i\)/complex-values). I was inspired by a stack overflow thread to use the following method. In particular, Arnold Neumaier had a nice answer that I wanted to dig into (prove to myself that it worked and see the mathematics behind it). Here is what I came up with.
I’ll start with a few facts that you can gather from a linear algebra or numerical methods textbook (and/or the Web).
- The eigenvalues of a matrix are postive if, and only if, the matrix is SPD.
- The eigenvalues of a diagonal matrix are the values of the diagonal.
- The matrices \(A\) and \(B\) are called similar matrices if we can write \(A=P^{-1}BP\) for some invertible \(P\).
- Similar matrices have the same characteristic polynomial, and hence, the same eigenvalues.
- Orthogonal matrices have \(A^T=A^{-1}\)
- There is a special type of orthogonal matrix called a Householder matrix. It has the form: \(H = I – 2vv^T\) for \(v\) an \(n\)-element column-vector (\(H\) is \(n \times n\)).
- Since Householder matrices are made up of \(I\) and \(vv^T\), we can see quickly that they must be symmetric \(H=H^T\).
One corollary: since \(H=H^{T}=H^{-1}\), we have that \(A=HBH=H^{T}BH=H^{-1}BH\) means that applying a Householder matrix on the left and right of \(B\) creates an \(A\) that is similar to \(B\). That’s exactly what we will do to a diagonal matrix with positive entries on the diagonal. Thus, we start with a (simple) SPD matrix and create a similar matrix that is more interesting. Incidentally, one other fact:
- The condition number of a matrix \(A\) is equal to the (absolute value of the) ratio of the biggest to the smallest eigenvalues of \(A\). That is, \(cond(A)=\left| \frac{max(eigval(A))}{min(eigval(A))} \right|\).
Creating an SPD Matrix from Random Vectors
So, if we set the eigenvalues appropriately, we can control the condition number to make a particular well-behaved or ill-behaved matrix. We can create a SPD test matrix \(T_{spd}\), with condition number \(C\), with the following steps:
- \(d=[1\ uniform(1,C)\ C]\), with \(n-2\) elements being sampled by the uniform distribution.
- \(D_+ = diag(d)\) be a \(n \times n\) matrix with diagonal \(d\)
- \(w=[uniform(0,1)]\) with \(n\) elements
- \(H=I-2ww^{T}\)
- \(T_{spd}=HD_{+}H\)
Note, I didn’t title this "Generating a Random SPD Matrix" (if I say it elsewhere, I deserve a slap on the wrist). I have no clue what the sampling properties of the matrices we are making are.
Applying Householder Matrices
A quick reminder about some matrix transpose rules: \((A+B)^T=A^T+B^T\) and \((AB)^T=B^TA^T\). Also, if \(D\) is diagonal (and square), then \(D=D^T\).
There are two interesting facts about Householder matrices. From above, \(H = I – 2ww^T\). (1) Householder matrices are completely defined by the underlying vector used to create them (i.e., the \(w\) vector). (2) To multiply (or "apply") a Householder matrix \(H\) to \(A\) on the left or right \(HA\) or \(AH\), we don’t need to make the full \(n \times n\) matrix of \(H\). We can just use the \(w\) vector. Golub and VanLoan (3rd, page 211) point out that (and you can derive in about two lines):
- \(HA=(I-{\beta}vv^t)=A-vw^T_L\) with \(w_L={\beta}A^Tv\), and,
- \(AH=A(I-{\beta}vv^T)=A-w_Rv^T\) with \(w_R={\beta}Av\)
Further, we’re going to be working with a diagonal matrix filling in for \(A\). This means that for our \(HDH\), we have:
\[HDH = (HD)H = \color{red}{(D-vw^T_L)}H = \color{red}{(D-vw^T_L)} – w_Rv^T = {\star}_1\]
In this case, \(w^T_L=({\beta}D^Tv)^T=({\beta}Dv)^T={\beta}v^TD\) and \(w_R={\beta}\color{red}{(D-vw^T_L)}v\). The red is meant to highlight the fact that the first multiplication/expansion plays an important role in the second multiplication/expansion.
Substituting in:
\[{\star}_1 = (D-{\beta}vv^TD) – {\beta}(D-vw^T_L)vv^T = {\star}_2\]
The second term is a bit tricky:
\[\begin{align}
{\beta}(D-vw^T_L)vv^T&=&{\beta}(D-v({\beta}v^TD))vv^T \\
&=&{\beta}Dvv^T-{\beta}v{\beta}v^TDvv^T \\
&=&{\beta}Dvv^T-{\beta}^2vv^TDvv^T \\
\end{align}
\]
So, we have: \[{\star}_2 = (D-{\beta}vv^TD) – {\beta}Dvv^T \color{red}{+} {\beta}^2vv^TDvv^T = D – {\beta}vv^TD – {\beta}Dvv^T \color{red}{+} {\beta}^2vv^TDvv^T\]
I highlighted the two pluses because, in a first draft, I missed them and had some flipped signs running around the end of my equations. When we expanded the second term, it had a minus in front of it. Can’t lose the negatives!
A Nicer Form?
At this point, you might be stuck. You might try some various factorizations, completing-the-squares, consulting algebra books for quadratic forms, etc. Not that I’d ever do that. However, you might wonder if the simple \(HA=A-vw^T\) form can be carried out for two applications of \(H\) (one left and one right). That is, can we say something like: \(HAH = A – dot(\cdot,\cdot) – dot(\cdot,\cdot)\), for some appropriate arguments to \(dot\). Looking at \({\star}_2\), we might have the brilliant idea that \(v\)s should stay the same. If it ain’t broke, don’t fix it. Then, let’s rearrange the scalars (they are push overs). We may get to this form:
\[D – v\boxed{{\beta}v^TD} – \boxed{{\beta}Dv}v^T + {\beta}^2vv^TDvv^T\]
Things are looking hopeful. But! What do we do with the last term (the first term, \(D\), is a standalone in the single multiplication rules from G&VL). It has a \(v\) on the left and a \(v^T\) on the right. We want one or the other, not both! What do we do? We split the difference. Or, more to the point we split it into two equal pieces (halve it) and use both! Also, \(v^TDv\) is a scalar (check the dimensions if you don’t believe me). I’ll rename it \(\alpha\).
\[
D – v\color{green}{\boxed{{\beta}v^TD}} – \color{red}{\boxed{{\beta}Dv}}v^T
+ \color{red}{\boxed{\frac{1}{2}{\beta}^2{\alpha}v}}v^T
+ v\color{green}{\boxed{\frac{1}{2}{\beta}^2{\alpha}v^T}}
\]
Now, look at the \(v\color{green}{\square}\) terms. If we factor out \(-v\) and call the result \(w_\star\), we have \(w_{\star} = \color{green}{{\beta}v^TD – \frac{1}{2}{\beta}^2{\alpha}v^T}\).
Can we come up with something for \(\color{red}{\square}v^T\)? Let’s see what \(w^T_\star\) is: \(w^T_{\star} = ({\beta}v^TD)^T – (\frac{1}{2}{\beta}^2{\alpha}v^T)^T = \color{red}{{\beta}Dv – \frac{1}{2}{\beta}^2{\alpha}v}\). Does it look familiar? (The color might help!) Fortunately, the answer is yes! And I’ll take serendipity in my mathematics any day. \(w^T_{\star}\) is the sum of the coefficients on the \(v^T\) terms. So, with all that, we have:
\[HDH = D – vw_{\star} – w^T_{\star}v^T\]
Lastly (really!), \(v\) is a column-vector, so by inference \(w_\star\) must be a row-vector. You can also see this because \(w_\star\) is made from \(v^T\). Good. Now, in NumPy, if we mutiply two vectors, we get the element-wise result between the two. We want the np.outer
product between the two vectors (np.dot
would be great if we had multi-dimensional critters). We also don’t have to worry about the transposes in the last term when we use np.outer
(the 1-D arrays are effectively "unoriented" vectors).
Some Code
Here we simply have basic code to either give us a fixed pair of vectors (for testing the tester – sort of like guarding the guardians) or to give us two random vectors with some conditioning constraints (for the generated matrix).
def make_random_vecs_with_cond(n, C, repeatable=True):
if repeatable:
d = np.linspace(1.0, C, n)
v = np.arange(1.0, n+1) / (n+1)
else:
d = np.empty(n)
d[0] = 1.0; d[n-1] = C
d[1:n-1] = np.random.uniform(1.0, C, n-2)
v = np.random.uniform(1.0, 100.0, n)
return d,v
Dead Simple Implementation
Here is the super-simple-visually-verify implementation. We actually ignore all the hard work above and explicitly construct the Householder matrix … so that we can verify that our "clever" method produces the same numerical result. Always test yourself. Incidentally, to test the "gold standard", we’ll compute the eigenvalues of the result and make sure they are the ones we initually used above (the np.linspace
call will produce a diagonal of [1, 5.5, 10]
for the call below.
def make_spd_explicit(n, C):
d,v = make_random_vecs_with_cond(n, C)
v = v[:,np.newaxis] # explicitly convert to column vector
nrm = (2.0 / (v.T.dot(v))).reshape(1)
# explicitly create house for testing purposes
house = np.eye(n) - nrm*v.dot(v.T)
D = np.diag(d)
direct = house.dot(D).dot(house)
return direct
expected = make_spd_explicit(3, 10)
print expected
print np.linalg.eig(expected)[0]
Better Method Using Our Hard Work
def make_spd(n,C):
d,v = make_random_vecs_with_cond(n,C)
# each of these should be O(n) steps
beta = (2.0 / (v.T.dot(v)))
dv = d*v # elem-wise version of Dv and v^TD
alpha = v.dot(dv)
w_star = (beta * dv) - (.5 * beta**2 * alpha * v)
# not TOO clever, is we really want to avoid extra memory layouts
# we could either a call to BLAS dger to update
# outer products in place (i.e., A -+ outer(v,w))
clever = np.diag(d) - np.outer(v,w_star) - np.outer(w_star,v)
return clever
actual = make_spd(3,10)
print np.allclose(expected, actual)
print np.linalg.eig(actual)[0]
And if you stuck with me through all of that, go have a beer, take a walk, or something relaxing!
Addendum: Even More Clever-er-er
import scipy.linalg.blas as slab
w = np.random.uniform(0.0, 1.0, 3)
v = np.random.uniform(0.0, 1.0, 3)
W = np.arange(9.0).reshape(3,3)
print W - np.outer(v,w) - np.outer(w,v)
# FAILS to update in place:
# see https://github.com/scipy/scipy/issues/5738
f = slab.get_blas_funcs('syr2', (W,v,w))
print f(-1,w,v,a=W,overwrite_a=True)
print W
So, back to a "custom" solution:
!cat myheader.h
from cffi import FFI
ffi = FFI()
with open("./myheader.h") as header_file:
ffi.cdef(header_file.read())
libblas = ffi.dlopen("/usr/lib64/libopenblas_openmp.so")
def get_dptr(arr):
return ffi.cast("double *", arr.__array_interface__['data'][0])
def my_syr2(A,alpha,x,y):
# A <- A + alpha x y.T + alpha y x.T (A is updated in place)
# this is a special case that will only work correctly for A is diagonal
# b/c of the the lower to upper triangle copy
N = A.shape[0]
assert N == A.shape[1] and N == x.size and N == y.size
libblas.cblas_dsyr2(101, 122, # RowMajor, LowerTriangular
N, alpha, get_dptr(x), 1, get_dptr(y), 1,
get_dptr(A), N)
# propogate lower triangle to upper triangle
A[np.triu_indices(N,1)] = A[np.tril_indices(N,-1)] # only works b/c off diag are zero
w = np.random.uniform(0.0, 1.0, 3)
v = np.random.uniform(0.0, 1.0, 3)
W = np.arange(9.0).reshape(3,3)
expected = W - np.outer(v,w) - np.outer(w,v)
print "expected:\n", expected
# FAILS to update in place:
# see https://github.com/scipy/scipy/issues/5738
my_syr2(W, -1, w, v)
print W
print np.allclose(expected,W)
def make_spd_opt(n,C):
d,v = make_random_vecs_with_cond(n,C)
# each of these should be O(n) steps
beta = (2.0 / (v.T.dot(v)))
dv = d*v # elem-wise version of Dv and v^TD
alpha = v.dot(dv)
w_star = (beta * dv) - (.5 * beta**2 * alpha * v)
clever = np.diag(d)
my_syr2(clever, -1, v, w_star)
return clever
expected = make_spd_explicit(3, 10)
actual = make_spd_opt(3,10)
print actual
print np.allclose(expected, actual)
print np.linalg.eig(actual)[0]
However, since \(D=D^T\), it isn’t too confusing to hack the scipy.linalg.blas.dsyr2
call to work correctly. The call will only honor overwrite_a
if the a
is in Fortran order – which a.T
will be:
def make_spd_opt2(n,C):
d,v = make_random_vecs_with_cond(n,C)
# each of these should be O(n) steps
beta = (2.0 / (v.T.dot(v)))
dv = d*v # elem-wise version of Dv and v^TD
alpha = v.dot(dv)
w_star = (beta * dv) - (.5 * beta**2 * alpha * v)
clever = np.diag(d)
slab.dsyr2(-1, w_star, v, a=clever.T, overwrite_a=True)
# have to propogate (unless we wanted to just use a packed rep.)
N = clever.shape[0]
clever[np.triu_indices(N,1)] = clever[np.tril_indices(N,-1)]
return clever
expected = make_spd_explicit(3, 10)
actual = make_spd_opt2(3,10)
print np.allclose(expected, actual)
Additional Resources
You can grab a copy of this notebook.
Even better, you can view it using nbviewer.
License
Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.