Three Paths to Least Squares Linear Regression

In the prior installment, we talked about solving the basic problem of linear regression using a standard least-squares approach. Along the way, I showed a few methods that gave the same results to the least squares problem. Today, I want to look at three methods for solving linear regression:

  1. solving the full normal equations via Cholesky decomposition
  2. solving a least-squares problem via QR decomposition
  3. solving a least-squares problem via SVD (singular value decomposition)

The following table may not make complete sense to start with, but consider it an outline of what we are going to discuss. nla/sla/cla refer to numpy.linalg, scipy.linalg, and cvxopt.lapack, respectively. Each of the top-level routines makes use of high quality LAPACK routines under the hood.

While we’ll use abbreviations for the LAPACK routine names, these abbreviated names don’t really exist in LAPACK. For example, there is no posv in LAPACK. posv is our shorthand for dposv which is the double (double precision floating-point) specific version of _posv. There is also sposv, cposv, and zposv. For example, see the pair of rows labelled "symmetric/Hermiatian positive definite" (the two row headers belong together, it’s just bad formatting) in Table 2.2 of the LUG.

The LAPACK routines for these solutions are described generally under (Solving) Linear Equations and Linear Least Squares. The first set of routines are for square matrices. In some cases, the Intel documentation for the MKL (which includes an LAPACK implementation) has better descriptions of some of these routines. See: (Solving) Linear Equations and Linear Least Squares.

Python
Library
Routine
Underlying
LAPACK
Routine
Matrix Type Algorithm
sla.solve
w/ sym_pos=True
posv symmetric positive
definite matrix
solve eqn using Cholesky
nla.solve
sla.solve
gesv square matrix solve eqn using LU
cla.solve gels general matrix (full rank) least-squares via QR
sla.lstsq gelss general matrix least-squares via SVD
nla.lstsq gelsd general matrix least-squares via Divide-and-Conquer SVD

Imports

Not too surprising, but we’ll do some imports to get us started. If you haven’t seen it before, cvxopt is a nice package for coding optimization problems. Because these problems often involve linear algebra, cvxopt has an (partial) interface to LAPACK and it exposes some routines that numpy and scipy‘s LAPACK interfaces miss. Annoyingly, none of these three LAPACK interfaces are complete. Even if you put them all together, there is still missing functionality. But, we’ll deal with a workaround next time.

In [1]:
import operator as op

import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla

from cvxopt import matrix as cvx_matrix
import cvxopt.lapack as cla

%load_ext memory_profiler

Data Creation and a Memory Profiling Note

In [2]:
def create_data(n=5000, p=100):
    ''' 
        n is number cases/observations/examples
        p is number of features/attributes/variables
    '''
    X      = np.random.uniform(0,10, (n,p))
    coeffs = np.random.uniform(0,10, p)
    def f(X): return X.dot(coeffs)

    noise = np.random.normal(0.0, 2.0, n)
    Y = f(X) + noise
    Y = Y.reshape((n,1))
    
    return X,Y

You’ll notice that I used an iPython magic (%load_ext memory_profiler) above in the import section. I’m not going to jump into the details of installing the memory profiler (but pip install memory_profiler might be all you need). However, we do have to discuss a few issues that come up with its use. If you want to recreate the memory use values I gather below, you’ll need to be a bit careful.

If I say:

    X,Y = create_data()
    %memit X.T.dot(X)

and look at the result, I might see a memory increment of ~8MB. That is: the process of creating X.T.dot(X) required an additional 8MB of memory. But, if I then go and run the dot-product again in the next cell:

    %memit X.T.dot(X)

it may report an increment of ~0MB. What happened?

The first X.T.dot(X) keeps no references to the result and the result can be free()‘d – garbage collected – back to the OS. However, the OS might not automatically reclaim that memory and/or Python’s garbage collector might not automatically release it (that’s a completely post worthy topic of its own). Thus, the amount of memory that our process is using is still at the +8MB level. So, when we run another dot-product, our process uses that unused-by-a-variable, but not reclaimed-by-the-OS, memory and we see no increment.

The moral of that story is that if you want to see consistent increments, you need to either (1) keep a reference to the resulting object (XtX = X.T.dot(X)) so the memory is both not freed and not reclaimed (i.e., it is in use) or (2) after running the %memit line, restart your kernel before you do the next %memit (foring a freed and reclaimed state for -all- of the process memory).

The problem is even a little more severe: any large computation (not just memit expressions) that is evaluated and not retained via a reference will cause this memory use masking — and it makes perfect sense when you think about it. So, all of my memits will be done in a fresh interpreter and I’ll simply leave comments about the results.

Create Some Data and Investigate Memory Use

In [3]:
n_obs = 2**16
n_ftrs = 2**10
X,Y = create_data(n_obs,n_ftrs)
print "about", 10**int(np.log10(n_obs * n_ftrs)), "entries"
about 10000000 entries
In [4]:
print X.shape
print X.itemsize, X.size, reduce(op.mul, X.shape)
# memory(X) : 2**(16 + 10 + 3) [2**3 (8 byte)]
print (X.itemsize * X.size) / float(1024**2)  # 1024**2 == 2**20 == 1 MiB
print 2**(16 + 10 + 3) / 2**20
(65536, 1024)
8 67108864 67108864
512.0
512
In [5]:
print X.shape
XtX = X.T.dot(X)
print XtX.shape
print (XtX.itemsize * XtX.size) / float(1024**2)
(65536, 1024)
(1024, 1024)
8.0

np.dot() Tests

Let’s do two simple tests to get an idea of (1) the cost of some basic operations and (2) understand the portion of resources used in the Cholesky that are not due to the factorization and backwards solution.

In [6]:
#%memit X.T.dot(X)
#increment ~27.5MB

It’s possibly surprising that \(X^TX\) is so much smaller than \(X\). However, we’re dealing with a "tall-skinny" matrix — there are many more rows than colums (\(n_{obs} >> n_{ftrs}\)). So, when we take the transpose and multiply we have a \((n_{ftr},n_{obs}) \times (n_{obs}, n_{ftr}) \rightarrow (n_{ftr},n_{ftr})\) which has far fewer elements than \((n_{obs}, n_{ftr})\).

In [7]:
%timeit X.T.dot(X)
1 loops, best of 3: 1.39 s per loop

Compressing data requires a trade-off in pre-processing. Here, we see that the dot-product takes a fair bit of time. Remember that it is doing its work (a vector multiplication) in the long (\(n_{obs}\)/column) dimension.

Via Cholesky

Recall that that solving linear regression using least squares means that we want to perform the following optimization: \(\min RSS(\beta)=\min(\bf{y}-\bf{X}\beta)^T(\bf{y}-\bf{X}\beta)\). That means that we want: \(\frac{\partial RSS(\beta)}{\partial \beta} = -2 X^T(Y-X\beta) = 0\). We saw that this gives: \(X^TX\beta = X^TY\), which we want to solve for \(\beta\).

\(Cholesky(X^TX)\) gives \(X^TX=LL^T\) with \(L\) lower triangular.

Then, we have \(L^TL\beta = X^TY\). We let \(w=L\beta\) and solve \(Lw=X^TY\) for \(w\) using simple backward substitution. Now, we return to \(L\beta=w\) and we solve it for \(\beta\) using simple forward substitution.

In [8]:
#%%memit
#XtX, XtY = X.T.dot(X), X.T.dot(Y)
#betas = sla.solve(XtX,XtY, sym_pos=True, check_finite=False)
# increment ~28MB
In [9]:
%%timeit 
XtX, XtY = X.T.dot(X), X.T.dot(Y)
betas = sla.solve(XtX,XtY, sym_pos=True, check_finite=False)
1 loops, best of 3: 1.45 s per loop

For comparison sake, here’s the same effective call but using cvxopt‘s LAPACK interface to access posv directly. Because cla.posv is so low-level it has the same "modify the input" semantics of LAPACK’s posv. Thus, if we want to keep the comparison between the two methods (sla.solve(sym_pos=True) and cla.posv) fair, we have to explicitly make our own copy. There’s a second reason we have to make that copy: the cvxopt.lapack interface expects cvxopt.matrix objects.

In [10]:
%%timeit
XtX, XtY = X.T.dot(X), X.T.dot(Y)
cXtX, cXtY = cvx_matrix(XtX), cvx_matrix(XtY)
cla.posv(cXtX, cXtY)
betas = cXtY[:n_ftrs]
1 loops, best of 3: 1.46 s per loop

Via QR

The QR decomposition let’s us write: \(X=QR\) so that \(X\beta=Y\) becomes \(QR\beta=Y\). The QR decomposition gives a \(Q\) that is orthogonal which means \(Q^T=Q^{-1}\). So, we can multiply on the left by \(Q^{-1}\): \(R \beta = Q^TY\). Lastly, \(R\) is upper triangular so it is efficient and stable to use backwards substitution to solve that system for \(\beta\) giving (the numerically stable equivalent to): \(\beta = R^{-1}Q^TY\).

Here, we will use cla.gels. As with cla.posv, we need to make our own copy to keep both the result and the input matrix available.

In [11]:
#%%memit
#cX, cY = cvx_matrix(X),cvx_matrix(Y)  # makes a copy
# increment:  ~512MB
In [12]:
#%%memit
#cla.gels(cX,cY) # betas are in cY[:p]
# increment: ~28 MiB
In [13]:
%%timeit
cX, cY = cvx_matrix(X),cvx_matrix(Y)  # makes a copy
1 loops, best of 3: 954 ms per loop
In [14]:
%%timeit
cX, cY = cvx_matrix(X),cvx_matrix(Y)  # makes a copy
cla.gels(cX,cY) # betas are in cY[:p]
betas = cY[:n_ftrs]
1 loops, best of 3: 8.23 s per loop

Via SVD

Performing linear regression with SVD is quite easy – even if the theory behind it is slightly more complex. If \(X=U \Sigma V\), then the pseudo-inverse of \(X\) is \(X^{\dagger} = V {\Sigma}^{\dagger} U^T\) and we can solve \(X\beta=Y\) with \(\beta = X^{\dagger} X \beta = X^\dagger Y = V{\Sigma}^{\dagger} U^T Y\)

In [15]:
#%%memit 
#betas = nla.lstsq(X,Y)[0]
# increment ~540MB
In [16]:
%%timeit
betas = nla.lstsq(X,Y)[0]
1 loops, best of 3: 8.16 s per loop

Comparing Betas

Perhaps we should have done this first, but we should make certain that all of our methods for producing the \(\beta\) coefficients give us the same results. Fortunately, it’s easy enough. You might wonder why we don’t just use the values we computed above. Well, like %memit, %timeit has quirks. And one of those quirks is that variables defined in a %%timeit cell don’t leak out into the global environment (they are local to the %timeit execution). Darn.

In [17]:
#
# Cholesky 
#
XtX, XtY = X.T.dot(X), X.T.dot(Y)
betas_chol_1 = sla.solve(XtX,XtY, sym_pos=True, check_finite=False)

cXtX, cXtY = cvx_matrix(XtX), cvx_matrix(XtY)
cla.posv(cXtX, cXtY)
betas_chol_2 = cXtY[:n_ftrs]


#
# QR
#
cX, cY = cvx_matrix(X),cvx_matrix(Y)  # makes a copy
cla.gels(cX,cY) # betas are in cY[:p]
betas_qr = cY[:n_ftrs]

#
# SVD
#
betas_svd = nla.lstsq(X,Y)[0]

betas = [betas_chol_1, betas_chol_2, betas_qr, betas_svd]
print "All same betas?:", all(np.allclose(b1, b2) for b1 in betas[:-1] for b2 in betas[1:])
All same betas?: True

Summary Results

For an np.array of ~10M entries (and 512MB) and with non-overwriting semantics, we have:

Algorithm Called Via Memory
Increment(MB)
~Time(s)
dot-product np.dot() 28 1.5
Cholesky sla.solve(sym_pos=True)
cla.posv()
28 1.5
QR cla.gels() 540 8
SVD nla.lstsq() 540 8

Now, if you’ve read about the Cholesky, QR, and SVD methods as applied to linear regression, you might be a bit concerned. You probably read that in-order they go from fastest to slowest and from least to most numerically stable. We haven’t talked at all about the stability issue, but we don’t see any difference between QR and SVD here. We also see that both QR and SVD are making full copies of the X matrix. In the case of QR, we saw that explicitly (we made the copy) because cla.gels is low enough level (close enough to the LAPACK gels call) that it overwrites its inputs.

We’ll dig into these methods by directly using the LAPACK routines, including a (hopefully) lower overhead gels next time.

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

Creative Commons License
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.