In the prior installment, we talked about solving the basic problem of linear regression using a standard least-squares approach. Along the way, I showed a few methods that gave the same results to the least squares problem. Today, I want to look at three methods for solving linear regression:
- solving the full normal equations via Cholesky decomposition
- solving a least-squares problem via QR decomposition
- solving a least-squares problem via SVD (singular value decomposition)
The following table may not make complete sense to start with, but consider it an outline of what we are going to discuss. nla
/sla
/cla
refer to numpy.linalg
, scipy.linalg
, and cvxopt.lapack
, respectively. Each of the top-level routines makes use of high quality LAPACK routines under the hood.
While we’ll use abbreviations for the LAPACK routine names, these abbreviated names don’t really exist in LAPACK. For example, there is no posv
in LAPACK. posv
is our shorthand for dposv
which is the double
(double precision floating-point) specific version of _posv
. There is also sposv
, cposv
, and zposv
. For example, see the pair of rows labelled "symmetric/Hermiatian positive definite" (the two row headers belong together, it’s just bad formatting) in Table 2.2 of the LUG.
The LAPACK routines for these solutions are described generally under (Solving) Linear Equations and Linear Least Squares. The first set of routines are for square matrices. In some cases, the Intel documentation for the MKL (which includes an LAPACK implementation) has better descriptions of some of these routines. See: (Solving) Linear Equations and Linear Least Squares.
Python Library Routine |
Underlying LAPACK Routine |
Matrix Type | Algorithm |
---|---|---|---|
sla.solve w/ sym_pos=True |
posv |
symmetric positive definite matrix |
solve eqn using Cholesky |
nla.solve sla.solve |
gesv |
square matrix | solve eqn using LU |
cla.solve |
gels |
general matrix (full rank) | least-squares via QR |
sla.lstsq |
gelss |
general matrix | least-squares via SVD |
nla.lstsq |
gelsd |
general matrix | least-squares via Divide-and-Conquer SVD |
Imports
Not too surprising, but we’ll do some imports to get us started. If you haven’t seen it before, cvxopt
is a nice package for coding optimization problems. Because these problems often involve linear algebra, cvxopt
has an (partial) interface to LAPACK and it exposes some routines that numpy
and scipy
‘s LAPACK interfaces miss. Annoyingly, none of these three LAPACK interfaces are complete. Even if you put them all together, there is still missing functionality. But, we’ll deal with a workaround next time.
import operator as op
import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla
from cvxopt import matrix as cvx_matrix
import cvxopt.lapack as cla
%load_ext memory_profiler
Data Creation and a Memory Profiling Note
def create_data(n=5000, p=100):
'''
n is number cases/observations/examples
p is number of features/attributes/variables
'''
X = np.random.uniform(0,10, (n,p))
coeffs = np.random.uniform(0,10, p)
def f(X): return X.dot(coeffs)
noise = np.random.normal(0.0, 2.0, n)
Y = f(X) + noise
Y = Y.reshape((n,1))
return X,Y
You’ll notice that I used an iPython magic (%load_ext memory_profiler
) above in the import
section. I’m not going to jump into the details of installing the memory profiler (but pip install memory_profiler
might be all you need). However, we do have to discuss a few issues that come up with its use. If you want to recreate the memory use values I gather below, you’ll need to be a bit careful.
If I say:
X,Y = create_data()
%memit X.T.dot(X)
and look at the result, I might see a memory increment of ~8MB. That is: the process of creating X.T.dot(X)
required an additional 8MB of memory. But, if I then go and run the dot-product again in the next cell:
%memit X.T.dot(X)
it may report an increment of ~0MB. What happened?
The first X.T.dot(X)
keeps no references to the result and the result can be free()
‘d – garbage collected – back to the OS. However, the OS might not automatically reclaim that memory and/or Python’s garbage collector might not automatically release it (that’s a completely post worthy topic of its own). Thus, the amount of memory that our process is using is still at the +8MB level. So, when we run another dot-product, our process uses that unused-by-a-variable, but not reclaimed-by-the-OS, memory and we see no increment.
The moral of that story is that if you want to see consistent increments, you need to either (1) keep a reference to the resulting object (XtX = X.T.dot(X)
) so the memory is both not freed and not reclaimed (i.e., it is in use) or (2) after running the %memit
line, restart your kernel before you do the next %memit
(foring a freed and reclaimed state for -all- of the process memory).
The problem is even a little more severe: any large computation (not just memit
expressions) that is evaluated and not retained via a reference will cause this memory use masking — and it makes perfect sense when you think about it. So, all of my memit
s will be done in a fresh interpreter and I’ll simply leave comments about the results.
Create Some Data and Investigate Memory Use
n_obs = 2**16
n_ftrs = 2**10
X,Y = create_data(n_obs,n_ftrs)
print "about", 10**int(np.log10(n_obs * n_ftrs)), "entries"
print X.shape
print X.itemsize, X.size, reduce(op.mul, X.shape)
# memory(X) : 2**(16 + 10 + 3) [2**3 (8 byte)]
print (X.itemsize * X.size) / float(1024**2) # 1024**2 == 2**20 == 1 MiB
print 2**(16 + 10 + 3) / 2**20
print X.shape
XtX = X.T.dot(X)
print XtX.shape
print (XtX.itemsize * XtX.size) / float(1024**2)
np.dot()
Tests
Let’s do two simple tests to get an idea of (1) the cost of some basic operations and (2) understand the portion of resources used in the Cholesky that are not due to the factorization and backwards solution.
#%memit X.T.dot(X)
#increment ~27.5MB
It’s possibly surprising that \(X^TX\) is so much smaller than \(X\). However, we’re dealing with a "tall-skinny" matrix — there are many more rows than colums (\(n_{obs} >> n_{ftrs}\)). So, when we take the transpose and multiply we have a \((n_{ftr},n_{obs}) \times (n_{obs}, n_{ftr}) \rightarrow (n_{ftr},n_{ftr})\) which has far fewer elements than \((n_{obs}, n_{ftr})\).
%timeit X.T.dot(X)
Compressing data requires a trade-off in pre-processing. Here, we see that the dot-product takes a fair bit of time. Remember that it is doing its work (a vector multiplication) in the long (\(n_{obs}\)/column) dimension.
Via Cholesky
Recall that that solving linear regression using least squares means that we want to perform the following optimization: \(\min RSS(\beta)=\min(\bf{y}-\bf{X}\beta)^T(\bf{y}-\bf{X}\beta)\). That means that we want: \(\frac{\partial RSS(\beta)}{\partial \beta} = -2 X^T(Y-X\beta) = 0\). We saw that this gives: \(X^TX\beta = X^TY\), which we want to solve for \(\beta\).
\(Cholesky(X^TX)\) gives \(X^TX=LL^T\) with \(L\) lower triangular.
Then, we have \(L^TL\beta = X^TY\). We let \(w=L\beta\) and solve \(Lw=X^TY\) for \(w\) using simple backward substitution. Now, we return to \(L\beta=w\) and we solve it for \(\beta\) using simple forward substitution.
#%%memit
#XtX, XtY = X.T.dot(X), X.T.dot(Y)
#betas = sla.solve(XtX,XtY, sym_pos=True, check_finite=False)
# increment ~28MB
%%timeit
XtX, XtY = X.T.dot(X), X.T.dot(Y)
betas = sla.solve(XtX,XtY, sym_pos=True, check_finite=False)
For comparison sake, here’s the same effective call but using cvxopt
‘s LAPACK interface to access posv
directly. Because cla.posv
is so low-level it has the same "modify the input" semantics of LAPACK’s posv
. Thus, if we want to keep the comparison between the two methods (sla.solve(sym_pos=True)
and cla.posv
) fair, we have to explicitly make our own copy. There’s a second reason we have to make that copy: the cvxopt.lapack
interface expects cvxopt.matrix
objects.
%%timeit
XtX, XtY = X.T.dot(X), X.T.dot(Y)
cXtX, cXtY = cvx_matrix(XtX), cvx_matrix(XtY)
cla.posv(cXtX, cXtY)
betas = cXtY[:n_ftrs]
Via QR
The QR decomposition let’s us write: \(X=QR\) so that \(X\beta=Y\) becomes \(QR\beta=Y\). The QR decomposition gives a \(Q\) that is orthogonal which means \(Q^T=Q^{-1}\). So, we can multiply on the left by \(Q^{-1}\): \(R \beta = Q^TY\). Lastly, \(R\) is upper triangular so it is efficient and stable to use backwards substitution to solve that system for \(\beta\) giving (the numerically stable equivalent to): \(\beta = R^{-1}Q^TY\).
Here, we will use cla.gels
. As with cla.posv
, we need to make our own copy to keep both the result and the input matrix available.
#%%memit
#cX, cY = cvx_matrix(X),cvx_matrix(Y) # makes a copy
# increment: ~512MB
#%%memit
#cla.gels(cX,cY) # betas are in cY[:p]
# increment: ~28 MiB
%%timeit
cX, cY = cvx_matrix(X),cvx_matrix(Y) # makes a copy
%%timeit
cX, cY = cvx_matrix(X),cvx_matrix(Y) # makes a copy
cla.gels(cX,cY) # betas are in cY[:p]
betas = cY[:n_ftrs]
Via SVD
Performing linear regression with SVD is quite easy – even if the theory behind it is slightly more complex. If \(X=U \Sigma V\), then the pseudo-inverse of \(X\) is \(X^{\dagger} = V {\Sigma}^{\dagger} U^T\) and we can solve \(X\beta=Y\) with \(\beta = X^{\dagger} X \beta = X^\dagger Y = V{\Sigma}^{\dagger} U^T Y\)
#%%memit
#betas = nla.lstsq(X,Y)[0]
# increment ~540MB
%%timeit
betas = nla.lstsq(X,Y)[0]
Comparing Betas
Perhaps we should have done this first, but we should make certain that all of our methods for producing the \(\beta\) coefficients give us the same results. Fortunately, it’s easy enough. You might wonder why we don’t just use the values we computed above. Well, like %memit
, %timeit
has quirks. And one of those quirks is that variables defined in a %%timeit
cell don’t leak out into the global environment (they are local to the %timeit
execution). Darn.
#
# Cholesky
#
XtX, XtY = X.T.dot(X), X.T.dot(Y)
betas_chol_1 = sla.solve(XtX,XtY, sym_pos=True, check_finite=False)
cXtX, cXtY = cvx_matrix(XtX), cvx_matrix(XtY)
cla.posv(cXtX, cXtY)
betas_chol_2 = cXtY[:n_ftrs]
#
# QR
#
cX, cY = cvx_matrix(X),cvx_matrix(Y) # makes a copy
cla.gels(cX,cY) # betas are in cY[:p]
betas_qr = cY[:n_ftrs]
#
# SVD
#
betas_svd = nla.lstsq(X,Y)[0]
betas = [betas_chol_1, betas_chol_2, betas_qr, betas_svd]
print "All same betas?:", all(np.allclose(b1, b2) for b1 in betas[:-1] for b2 in betas[1:])
Summary Results
For an np.array
of ~10M entries (and 512MB) and with non-overwriting semantics, we have:
Algorithm | Called Via | Memory Increment(MB) |
~Time(s) |
---|---|---|---|
dot-product | np.dot() |
28 | 1.5 |
Cholesky | sla.solve(sym_pos=True) cla.posv() |
28 | 1.5 |
QR | cla.gels() |
540 | 8 |
SVD | nla.lstsq() |
540 | 8 |
Now, if you’ve read about the Cholesky, QR, and SVD methods as applied to linear regression, you might be a bit concerned. You probably read that in-order they go from fastest to slowest and from least to most numerically stable. We haven’t talked at all about the stability issue, but we don’t see any difference between QR and SVD here. We also see that both QR and SVD are making full copies of the X
matrix. In the case of QR, we saw that explicitly (we made the copy) because cla.gels
is low enough level (close enough to the LAPACK gels
call) that it overwrites its inputs.
We’ll dig into these methods by directly using the LAPACK routines, including a (hopefully) lower overhead gels
next time.
Additional Resources
You can grab a copy of this notebook.
Even better, you can view it using nbviewer.
License
Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.