In the last post, we tried to use lower-level LAPACK/LAPACKE calls to remove some of the copying overhead of our operations. Well, it turns out that it didn’t work out so well (see the table at the end of the post). The fundamental reason why is that LAPACKE took the easy way out, when they decided to wrap LAPACK functions. In particular, whenever there is a C-/Fortran-ordering issue (for a reminder), LAPACKE simply makes a copy.
Spin Up
import numpy as np
from shared import (lap, get_dptr, create_data1, create_data2, simple_args, my_dlalsd)
from shared_dgelsd import make_structures_dgelsd
%load_ext memory_profiler
And now for something completely different …
Here we have two re-worked methods for solving least squares using either Fortran (col-major) data or, here’s the fun part folks, solving the least squares problem with C (row-major) data by calling the LAPACK/E routines and lying. We tell the Fortran routines that we have Fortran data (remember, in memory it is in row-major order) and we solve a "transposed" problem. That turns out to be just about trivial for using the \(QR\) based method because the function call accepts a "T" (transposed) argument.
def sls_qr_via_dgels(X,Y,trans):
assert X.flags.f_contiguous or trans # don't allow the copy case although it should work
# assert np.linalg.matrix_rank(X) == np.shape[1] # "full column rank"
m,n,nrhs,ldX,ldY,la_order = simple_args(X,Y,trans)
X_p, Y_p = get_dptr(X), get_dptr(Y)
TN = "N" if not trans else "T"
info = lap.LAPACKE_dgels(la_order, TN, m,n,nrhs,
X_p, ldX, Y_p, ldY)
beta_shape = n if not trans else m
return Y[:beta_shape]
Manual Implementation of dgelsd
for C-order without Copies
However, for \(SVD\), things are a bit more tricky. I based this off of dgelsd
(the more "modern", slightly faster, slightly more memory hungry) \(SVD\) least squares solver. Since the outer dgelsd
call doesn’t know about solving a transposed problem, we have to do it by hand.
Here’s a small aside. The entire LAPACKE wrapper for LAPACK could adopt the techniques I’m using in this post and in many cases (if not all?), could remove unnecessary copies. The problem is that copying and transposing gives a direct way to use the preexisting implementation. The techniques I use here require rewriting a (mathematically) equivalent implementation. LAPACK is unweidly (from a software engineering perspective), as is. It’s an exact case study for the labor savings of OOP (the tradeoff for performance is the main competingn concern). Adding another set of implementations is sure to causes headaches.
The basic algorithm of dgelsd
given the regression problem \(X \beta = Y\) is:
- Decompose X to two orthogonal matrices and an (upper) bidiagonal: \(X \rightarrow QBP\)
- Rotate \(Y\) with \(Q^T\) (this is done inplace, so it updates \(Y \leftarrow Q^T Y\))
- Solve \(B\beta_{tmp} = Q^T Y\) (written in original terms, don’t rotate \(Y\) twice). Note, this process gives us both the \(\beta_{tmp}\) and the singular values of \(X\).
- Rotate \(\beta_{tmp}\) with \(P\): \(\beta \leftarrow P\beta_{tmp}\)
These steps are performed with these Fortran LAPACK calls:
dgebrd
dormbr
dlalsd
dormbr
The combination of dgebrd
and dlalsd
performs an "applied" \(SVD\). dgebrd
actually computes \(Q\) and \(P^T\) (which is status quo in many numerical systems — probably because they either call LAPACK or want to be compatible).
One other step in dgelsd
is dealing with the case when there are many more rows than columns. Actually, dgelsd
handles under- and over-determined systems differently, but we’re assuming (and simplifying) to consider that we are either (1) square or (2) tall-and-skinny (over-determined). With many more rows than columsn, dgelsd
first performs a \(QR\) decomposition (using dgeqrf
) on \(X\) and then proceeds as above using \(R\) in place of \(X\) and \(Q_{QR}^T Y\) for \(Y\) (the rotation is applied using dormqr
).
That is all the "easy" case. We still have talked about what it means to do dgelsd
in a transposed sense. Remember, dgels
had an explicit T
/N
flag that could control whether to transpose the system or not. Without futher ado, here’s the mathematical outline of steps:
- \(X^T \rightarrow (QBP)^T = P^TB^TQ^T\) (as above with \(Q \leftrightarrow P\); \(B\) was upper bidiagonal, \(B^T\) should be lower bidiagonal.
- Rotate \(Y\) with \(P^T\): \(Y \leftarrow P^TY\)
- Solve \(B^T \beta_{tmp} = P^TY\) giving \(\beta_{tmp}\) and the singular values
- Rotate \(\beta_{tmp}\) with \(Q\): \(\beta \leftarrow Q\beta_{tmp}\)
The analogue (tranpose) of the \(QR\) step for the many rows case, is to perform an \(LQ\) decomposition (the decomposition is done with dgelqf
and the rotation is applied with dormlq
). That is: \(QR(X)\) and \(LQ(X.T)\) give equivalent \(Q\) values.
Getting all of this encoded properly was a royal pain in the tookis. The row-/col-major ordering, the transposed problem, the transposes (or not) on the orthogonal matrices, the upper-/lower- bidiagonalization (which is only computed as an upper because we have more rows than columns), and the differences in how LAPACK’s \(QR\) and \(LQ\) store their results made me feel like I was caught in an Escher painting.
In additional to encoding the mathematics (properly), it turned out that dlalsd
is not wrapped by the current version of LAPACKE. So, I had to write a small wrapper for it.
dlalsd Aside
To make use of dlalsd
, I wrote a tiny C library. Then, I exposed it in Python using the cffi
technique. I then gave it a Python function name of my_dlalsd
. My wrapper is to the LAPACK (FORTRAN) function, so it doesn’t have the C-ORDER "capabilities" (automatic copy and transpose) of other LAPACKE routines.
Here’s a quick look at the .h, .c, and Makefile and the Python helper. The dlalsd_
declaration is necesssary because it is defined externally (in libreflapack.so
).
/* mylapack.h */
int MY_dlalsd_work(char UPLO, int SMLSIZ, int N, int NRHS,
double *D, double *E, double *B,int LDB,
double RCOND, int *rank, double *work, int *iwork);
/* mylapack.c */
#include <stdio.h>
void dlalsd_(char *UPLO, int *SMLSIZ, int *N, int *NRHS,
double *D, double *E, double *B,int *LDB,
double *RCOND, int *rank, double *work, int *iwork, int *info);
int MY_dlalsd_work(char UPLO, int SMLSIZ, int N, int NRHS,
double *D, double *E, double *B,int LDB,
double RCOND, int *rank, double *work, int *iwork){
int info = 0;
dlalsd_(&UPLO, &SMLSIZ, &N, &NRHS, D, E, B, &LDB, &RCOND, rank, work, iwork, &info);
if (info < 0)
info = info - 1;
return info;
}
This Makefile worked on my Linux box with gcc.
# Makefile
all:
gcc -shared -fPIC -o libmylapack.so mylapack.c -lreflapack
And a little Python shim.
from cffi import FFI
ffi = FFI()
with open("./mylapack.h") as header_file:
ffi.cdef(header_file.read())
mylapack = ffi.dlopen("./libmylapack.so")
my_dlalsd = mylapack.MY_dlalsd_work
The Code
def sls_svd_via_manual_dgelsd(X,Y,trans):
''' we essentially reimplement dgelsd and add the capability of solving a transposed problem
i.e., $X \Beta = Y$ --or-- $X.T \Beta = Y$ '''
assert X.flags.f_contiguous or trans
common_args, arrays, pointers = make_structures_dgelsd(X,Y,trans=trans)
m,n,nrhs,ldX,ldY,la_order = common_args
singular_values, E, Q, P, rank, M, I = arrays
X_p, Y_p, SV_p, E_p, Q_p, P_p, rank_p, M_p, I_p = pointers
# helpful names
minmn = min(m,n)
real_rows, real_cols = (m,n) if not trans else (n,m)
y_shape, beta_shape = real_rows, real_cols
UL_arg = 'U' # without preprocessing, both trans={T,F} need 'U'
# cross over for pre-QRing is defined at:
# tsiee.f line 576 ... thresh = min(m,n) * 1.6
lotsaRows = real_rows >= real_cols * 1.6
if lotsaRows:
if not trans:
decomp, mult, TN = lap.LAPACKE_dgeqrf_work, lap.LAPACKE_dormqr_work, 'T'
else:
decomp, mult, TN = lap.LAPACKE_dgelqf_work, lap.LAPACKE_dormlq_work, 'N'
UL_arg = 'L' # B/E need to be used to define a lower bi-di
info = decomp(la_order, m, n, X_p, ldX, E_p, M_p, M.size)
assert not info, "%s:%s" % (decomp.__name__, info)
# this can be done at the end (so says dgelsd.f)
info = mult(la_order, 'L', TN, y_shape, nrhs, real_cols, X_p, ldX, E_p, Y_p, ldY, M_p, M.size)
assert not info, "%s:%s" % (decomp.__name__, info)
y_shape = beta_shape = n = m = minmn
X[np.tril_indices(minmn, -1)] = 0.0
# print "calling dgebrd"
assert M.size >= max(m,n) # minimal workspace
info = lap.LAPACKE_dgebrd_work(la_order, m, n, X_p, ldX, SV_p, E_p, Q_p, P_p, M_p, M.size)
assert not info, "dgebrd bad arg %d" % info
pq, PQ_p, real_k = ('Q', Q_p, n) if not trans else ('P', P_p, m)
# print "calling dormbr (I)"
info = lap.LAPACKE_dormbr_work(la_order, pq, 'L', 'T', y_shape, nrhs, real_cols,
X_p, ldX, PQ_p, Y_p, ldY, M_p, M.size)
assert not info, "dormbr (I) bad arg %d" % info
# print "calling dlalsd"
# FIXME: this wraps a direct call to LAPACK (broken for C-ORDER)
# 25 is "small size" for this dlalsd
# -1 says to use machine eps for rcond parameter (det. small singular values)
info = my_dlalsd(UL_arg, 25, minmn, nrhs, SV_p, E_p, Y_p, ldY, -1, rank_p, M_p, I_p)
assert not info, "dlalsd bad arg %d" % info
pq, PQ_p = ('P', P_p) if not trans else ('Q', Q_p)
# print "calling dormbr (II)"
info = lap.LAPACKE_dormbr_work(la_order, pq, 'L', 'N', beta_shape, nrhs, n,
X_p, ldX, PQ_p, Y_p, ldY, M_p, M.size)
assert not info, "dormbr (II) bad arg %d" % info
return Y[:beta_shape], singular_values
Testing
There were more than enough corner cases (and missteps) while I was coding to actually want test cases so I didn’t "fix" problems with regressions (the software engineering kind) in other parts of the code. So, some unittests
to the rescue:
!python test_dgelsd.py
Same Results?
So, let’s run each of these routines with row- and column-major data and verify we get the same output.
#
# we have to copy the data so we can use the same data in the next call
# lapack/e overwrite its inputs with the results (decomps and betas)
#
def copy_and_do(func, X, Y, trans, order):
_X,_Y = np.copy(X,order=order), np.copy(Y,order=order)
return func(_X,_Y,trans)
n_obs, n_ftrs = 2**16, 2**10
#n_obs, n_ftrs = 10, 2
#n_obs, n_ftrs = 1024, 50
# X,Y = create_data2()
X,Y = create_data1(n_obs, n_ftrs)
betas_qr_f = copy_and_do(sls_qr_via_dgels, X, Y, False, "F")
betas_svd_f = copy_and_do(sls_svd_via_manual_dgelsd, X, Y, False, "F")[0]
betas_qr_c = copy_and_do(sls_qr_via_dgels, X, Y, True, "C")
betas_svd_c = copy_and_do(sls_svd_via_manual_dgelsd, X, Y, True, "C")[0]
betas_expected = np.linalg.lstsq(X,Y)[0]
# betas = [betas_qr_f, betas_svd_f[0], betas_qr_c, betas_svd_c[0]]
betas = [betas_expected, betas_qr_f, betas_qr_c, betas_svd_f, betas_svd_c]
#for b in betas:
# print b
print "All same betas?:", all(np.allclose(b1, b2) for b1 in betas[:-1] for b2 in betas[1:])
Memory Usage
n_obs, n_ftrs = 2**16, 2**10
X,Y = create_data1(n_obs, n_ftrs, "C")
# %memit sls_qr_via_dgels(X,Y,True) # peak memory: 600.68 MiB, increment: 3.19 MiB
# %memit sls_svd_via_manual_dgelsd(X,Y,True) # peak memory: 584.77 MiB, increment: 20.05 MiB
# for comparison:
# %memit np.linalg.lstsq(X,Y) # peak memory: 1082.57 MiB, increment: 517.70 MiB
And now, it’s time for a big happy dance with music: we avoided copies AND we stayed in row-major (C-land) order!
Timing
n_obs, n_ftrs = 2**16, 2**10
_X,_Y = create_data1(n_obs, n_ftrs, "C")
%timeit np.linalg.lstsq(_X,_Y)
# forcing copies since all of these calls place the results in the input (overwrite our data)
X,Y = np.copy(_X), np.copy(_Y)
%timeit sls_qr_via_dgels(X,Y,True)
X,Y = np.copy(_X), np.copy(_Y)
%timeit sls_svd_via_manual_dgelsd(X,Y,True)
import scipy.linalg as sla
X,Y = np.copy(_X), np.copy(_Y)
# NOTE: sla.lstsq uses dgelss insead of dgelsd
%timeit sla.lstsq(X,Y,overwrite_a=True,overwrite_b=True)
A few things stand out here.
Even though np.linalg.lstsq
makes a copy (and it also has additional memory overhead to return its other results – it doesn’t overwrite), it is still faster at this problem size. I’ll tinker with finding a tradeoff point, but remember the point here was (largely) about memory and less about processing time (although the two are intimately intertwined).
My Python sls
functions are not particularly optimized. The sls_qr_via_dgels
function is very small. The sls_svd_via_manual_dgelsd
, however, has a lot of machinery on the inside. I’ve left some of that machinery out of this post (which is already quite long and detailed). But, there is memory allocation, workspace (temporary) size calculation, and a variety of utility routines that I’ve only lightly tuned. Most of those are only executed once, so we’re not talking about inner loop performance hits.
Next, while scipy.linalg.lstsq
can operate with overwriting, it uses a different underlying algorithm dgelss
instead of dgelsd
. That algorithm is a bit slower (than dgelsd
), but it uses less memory.
LAPACK Glossary
There are many LAPACK routine names in the post. Here’s a quick guide to interpreting them.
Important abbreviations:
d
: the matrix is made ofdouble
elements (double precision floating point)ge
: the matrix is a "general" matrix. It has no more specific structure like diagonal, triangular, orthogonal, etc.or
: we are dealing with an "orthogonal" matrix (the nicest propery of orthogonal matrices is that the inverse is equal to the, easy to compute, transpose)lqf
,qrf
: \(LQ\) and \(QR\) "factorizations" (aka, decompositions)ls
: perform least squares
After some practice, you will start "seeing" the abbreviations embedded in the six letter FORTRAN names.
LAPACK Routine | expanded name | actions |
---|---|---|
dgels | double general least squares | solves least squares with \(QR\) |
dgebrd | double general bidiagonalization | $X QBP^T $ with Q,P orthogonal and B bidiagonal |
dormbr | double orthogonal multiply br |
multiply a matrix by the Q or P produced from dgebrd |
dlalsd | double la least squares svd (?) |
solve least squares problem on a bidiagonal matrix using \(SVD\) |
dgeqrf | double general \(QR\) factorization | ’nuff said |
dgelqf | double general \(LQ\) factorization | ’nuff said |
dormqr | double orthogonal multiple qr |
multiply a matrix by the Q produced from dgeqrf |
dormlq | double orthogonal multiple lq |
multiply a matrix by the Q produced from dgelqf |
Additional Resources
You can grab a copy of this notebook.
Even better, you can view it using nbviewer.
License
Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.