Into the Weeds III – Hacking LAPACK Calls To Save Memory

In the last post, we tried to use lower-level LAPACK/LAPACKE calls to remove some of the copying overhead of our operations. Well, it turns out that it didn’t work out so well (see the table at the end of the post). The fundamental reason why is that LAPACKE took the easy way out, when they decided to wrap LAPACK functions. In particular, whenever there is a C-/Fortran-ordering issue (for a reminder), LAPACKE simply makes a copy.

Spin Up

In [1]:
import numpy as np

from shared import (lap, get_dptr, create_data1, create_data2, simple_args, my_dlalsd)
from shared_dgelsd import make_structures_dgelsd

%load_ext memory_profiler

And now for something completely different …

Here we have two re-worked methods for solving least squares using either Fortran (col-major) data or, here’s the fun part folks, solving the least squares problem with C (row-major) data by calling the LAPACK/E routines and lying. We tell the Fortran routines that we have Fortran data (remember, in memory it is in row-major order) and we solve a "transposed" problem. That turns out to be just about trivial for using the \(QR\) based method because the function call accepts a "T" (transposed) argument.

In [2]:
def sls_qr_via_dgels(X,Y,trans):
    assert X.flags.f_contiguous or trans # don't allow the copy case although it should work
    # assert np.linalg.matrix_rank(X) == np.shape[1] # "full column rank"

    m,n,nrhs,ldX,ldY,la_order = simple_args(X,Y,trans)
    X_p, Y_p = get_dptr(X), get_dptr(Y)

    TN = "N" if not trans else "T"
    info = lap.LAPACKE_dgels(la_order, TN, m,n,nrhs,
                             X_p, ldX, Y_p, ldY)    

    beta_shape = n if not trans else m
    return Y[:beta_shape]

Manual Implementation of dgelsd for C-order without Copies

However, for \(SVD\), things are a bit more tricky. I based this off of dgelsd (the more "modern", slightly faster, slightly more memory hungry) \(SVD\) least squares solver. Since the outer dgelsd call doesn’t know about solving a transposed problem, we have to do it by hand.

Here’s a small aside. The entire LAPACKE wrapper for LAPACK could adopt the techniques I’m using in this post and in many cases (if not all?), could remove unnecessary copies. The problem is that copying and transposing gives a direct way to use the preexisting implementation. The techniques I use here require rewriting a (mathematically) equivalent implementation. LAPACK is unweidly (from a software engineering perspective), as is. It’s an exact case study for the labor savings of OOP (the tradeoff for performance is the main competingn concern). Adding another set of implementations is sure to causes headaches.

The basic algorithm of dgelsd given the regression problem \(X \beta = Y\) is:

  1. Decompose X to two orthogonal matrices and an (upper) bidiagonal: \(X \rightarrow QBP\)
  2. Rotate \(Y\) with \(Q^T\) (this is done inplace, so it updates \(Y \leftarrow Q^T Y\))
  3. Solve \(B\beta_{tmp} = Q^T Y\) (written in original terms, don’t rotate \(Y\) twice). Note, this process gives us both the \(\beta_{tmp}\) and the singular values of \(X\).
  4. Rotate \(\beta_{tmp}\) with \(P\): \(\beta \leftarrow P\beta_{tmp}\)

These steps are performed with these Fortran LAPACK calls:

  1. dgebrd
  2. dormbr
  3. dlalsd
  4. dormbr

The combination of dgebrd and dlalsd performs an "applied" \(SVD\). dgebrd actually computes \(Q\) and \(P^T\) (which is status quo in many numerical systems — probably because they either call LAPACK or want to be compatible).

One other step in dgelsd is dealing with the case when there are many more rows than columns. Actually, dgelsd handles under- and over-determined systems differently, but we’re assuming (and simplifying) to consider that we are either (1) square or (2) tall-and-skinny (over-determined). With many more rows than columsn, dgelsd first performs a \(QR\) decomposition (using dgeqrf) on \(X\) and then proceeds as above using \(R\) in place of \(X\) and \(Q_{QR}^T Y\) for \(Y\) (the rotation is applied using dormqr).

That is all the "easy" case. We still have talked about what it means to do dgelsd in a transposed sense. Remember, dgels had an explicit T/N flag that could control whether to transpose the system or not. Without futher ado, here’s the mathematical outline of steps:

  1. \(X^T \rightarrow (QBP)^T = P^TB^TQ^T\) (as above with \(Q \leftrightarrow P\); \(B\) was upper bidiagonal, \(B^T\) should be lower bidiagonal.
  2. Rotate \(Y\) with \(P^T\): \(Y \leftarrow P^TY\)
  3. Solve \(B^T \beta_{tmp} = P^TY\) giving \(\beta_{tmp}\) and the singular values
  4. Rotate \(\beta_{tmp}\) with \(Q\): \(\beta \leftarrow Q\beta_{tmp}\)

The analogue (tranpose) of the \(QR\) step for the many rows case, is to perform an \(LQ\) decomposition (the decomposition is done with dgelqf and the rotation is applied with dormlq). That is: \(QR(X)\) and \(LQ(X.T)\) give equivalent \(Q\) values.

Getting all of this encoded properly was a royal pain in the tookis. The row-/col-major ordering, the transposed problem, the transposes (or not) on the orthogonal matrices, the upper-/lower- bidiagonalization (which is only computed as an upper because we have more rows than columns), and the differences in how LAPACK’s \(QR\) and \(LQ\) store their results made me feel like I was caught in an Escher painting.

In additional to encoding the mathematics (properly), it turned out that dlalsd is not wrapped by the current version of LAPACKE. So, I had to write a small wrapper for it.

dlalsd Aside

To make use of dlalsd, I wrote a tiny C library. Then, I exposed it in Python using the cffi technique. I then gave it a Python function name of my_dlalsd. My wrapper is to the LAPACK (FORTRAN) function, so it doesn’t have the C-ORDER "capabilities" (automatic copy and transpose) of other LAPACKE routines.

Here’s a quick look at the .h, .c, and Makefile and the Python helper. The dlalsd_ declaration is necesssary because it is defined externally (in libreflapack.so).

/* mylapack.h */
int MY_dlalsd_work(char UPLO, int SMLSIZ, int N, int NRHS, 
                   double *D, double *E, double *B,int LDB,   
                   double RCOND, int *rank, double *work, int *iwork);   

/* mylapack.c */
#include <stdio.h>
void dlalsd_(char *UPLO, int *SMLSIZ, int *N, int *NRHS,
             double *D, double *E, double *B,int *LDB,
             double *RCOND, int *rank, double *work, int *iwork, int *info);

int MY_dlalsd_work(char UPLO, int SMLSIZ, int N, int NRHS,
                   double *D, double *E, double *B,int LDB,
                   double RCOND, int *rank, double *work, int *iwork){
  int info = 0;
  dlalsd_(&UPLO, &SMLSIZ, &N, &NRHS, D, E, B, &LDB, &RCOND, rank, work, iwork, &info);
  if (info < 0)
    info = info - 1;

  return info;
}

This Makefile worked on my Linux box with gcc.

# Makefile
all:
    gcc -shared -fPIC -o libmylapack.so mylapack.c -lreflapack

And a little Python shim.

from cffi import FFI
ffi = FFI()
with open("./mylapack.h") as header_file:
    ffi.cdef(header_file.read())
mylapack = ffi.dlopen("./libmylapack.so")
my_dlalsd = mylapack.MY_dlalsd_work

The Code

In [3]:
def sls_svd_via_manual_dgelsd(X,Y,trans):
    ''' we essentially reimplement dgelsd and add the capability of solving a transposed problem
        i.e., $X \Beta = Y$ --or-- $X.T \Beta = Y$ '''
    assert X.flags.f_contiguous or trans    
    common_args, arrays, pointers = make_structures_dgelsd(X,Y,trans=trans)

    m,n,nrhs,ldX,ldY,la_order                       = common_args
    singular_values, E, Q, P, rank, M, I            = arrays
    X_p, Y_p, SV_p, E_p, Q_p, P_p, rank_p, M_p, I_p = pointers

    # helpful names
    minmn = min(m,n)
    real_rows, real_cols = (m,n) if not trans else (n,m)
    y_shape, beta_shape = real_rows, real_cols

    UL_arg = 'U' # without preprocessing, both trans={T,F} need 'U'

    # cross over for pre-QRing is defined at:
    # tsiee.f line 576 ... thresh = min(m,n) * 1.6
    lotsaRows = real_rows >= real_cols * 1.6
    if lotsaRows:
        if not trans:
            decomp, mult, TN = lap.LAPACKE_dgeqrf_work, lap.LAPACKE_dormqr_work, 'T'
        else:
            decomp, mult, TN = lap.LAPACKE_dgelqf_work, lap.LAPACKE_dormlq_work, 'N'
            UL_arg = 'L' # B/E need to be used to define a lower bi-di

        info = decomp(la_order, m, n, X_p, ldX, E_p, M_p, M.size)
        assert not info, "%s:%s" % (decomp.__name__, info)

        # this can be done at the end (so says dgelsd.f) 
        info = mult(la_order, 'L', TN, y_shape, nrhs, real_cols, X_p, ldX, E_p, Y_p, ldY, M_p, M.size)
        assert not info, "%s:%s" % (decomp.__name__, info)
            
        y_shape = beta_shape = n = m = minmn
        X[np.tril_indices(minmn, -1)] = 0.0

    # print "calling dgebrd"
    assert M.size >= max(m,n) # minimal workspace
    info = lap.LAPACKE_dgebrd_work(la_order, m, n, X_p, ldX, SV_p, E_p, Q_p, P_p, M_p, M.size)
    assert not info, "dgebrd bad arg %d" % info

    pq, PQ_p, real_k = ('Q', Q_p, n)  if not trans else ('P', P_p, m)
    # print "calling dormbr (I)"  
    info = lap.LAPACKE_dormbr_work(la_order, pq, 'L', 'T', y_shape, nrhs, real_cols,
                                   X_p, ldX, PQ_p, Y_p, ldY, M_p, M.size)
    assert not info, "dormbr (I) bad arg %d" % info
    
    # print "calling dlalsd"
    # FIXME: this wraps a direct call to LAPACK (broken for C-ORDER)
    # 25 is "small size" for this dlalsd
    # -1 says to use machine eps for rcond parameter (det. small singular values)
    info = my_dlalsd(UL_arg, 25, minmn, nrhs, SV_p, E_p, Y_p, ldY, -1, rank_p, M_p, I_p)
    assert not info, "dlalsd bad arg %d" % info
    
    pq, PQ_p = ('P', P_p) if not trans else ('Q', Q_p)
    # print "calling dormbr (II)"
    info = lap.LAPACKE_dormbr_work(la_order, pq, 'L', 'N', beta_shape, nrhs, n,
                                   X_p, ldX, PQ_p, Y_p, ldY, M_p, M.size)
    assert not info, "dormbr (II) bad arg %d" % info

    return Y[:beta_shape], singular_values

Testing

There were more than enough corner cases (and missteps) while I was coding to actually want test cases so I didn’t "fix" problems with regressions (the software engineering kind) in other parts of the code. So, some unittests to the rescue:

In [4]:
!python test_dgelsd.py
test_heavy_pure_fortran_1 (__main__.TestManual_dgelsd) ... ok
test_heavy_pure_fortran_2 (__main__.TestManual_dgelsd) ... ok
test_heavy_tranposed_corder_1 (__main__.TestManual_dgelsd) ... ok
test_heavy_tranposed_corder_2 (__main__.TestManual_dgelsd) ... ok
test_medium_pure_fortran_1 (__main__.TestManual_dgelsd) ... ok
test_medium_pure_fortran_2 (__main__.TestManual_dgelsd) ... ok
test_medium_tranposed_corder_1 (__main__.TestManual_dgelsd) ... ok
test_medium_tranposed_corder_2 (__main__.TestManual_dgelsd) ... ok
test_simple_pure_fortran (__main__.TestManual_dgelsd) ... ok
test_simple_transposed_corder (__main__.TestManual_dgelsd) ... ok

----------------------------------------------------------------------
Ran 10 tests in 40.326s

OK

Same Results?

So, let’s run each of these routines with row- and column-major data and verify we get the same output.

In [5]:
#
# we have to copy the data so we can use the same data in the next call
# lapack/e overwrite its inputs with the results (decomps and betas)
#
def copy_and_do(func, X, Y, trans, order):
    _X,_Y = np.copy(X,order=order), np.copy(Y,order=order)
    return func(_X,_Y,trans)

n_obs, n_ftrs = 2**16, 2**10
#n_obs, n_ftrs = 10, 2
#n_obs, n_ftrs = 1024, 50

# X,Y = create_data2()
X,Y = create_data1(n_obs, n_ftrs)
betas_qr_f  = copy_and_do(sls_qr_via_dgels,          X, Y, False, "F")
betas_svd_f = copy_and_do(sls_svd_via_manual_dgelsd, X, Y, False, "F")[0]

betas_qr_c  = copy_and_do(sls_qr_via_dgels,          X, Y, True, "C")
betas_svd_c = copy_and_do(sls_svd_via_manual_dgelsd, X, Y, True, "C")[0]

betas_expected = np.linalg.lstsq(X,Y)[0]

# betas = [betas_qr_f, betas_svd_f[0], betas_qr_c, betas_svd_c[0]]
betas = [betas_expected, betas_qr_f, betas_qr_c, betas_svd_f, betas_svd_c]
#for b in betas:
#    print b
print "All same betas?:", all(np.allclose(b1, b2) for b1 in betas[:-1] for b2 in betas[1:])
All same betas?: True

Memory Usage

In [6]:
n_obs, n_ftrs = 2**16, 2**10
X,Y = create_data1(n_obs, n_ftrs, "C")
# %memit sls_qr_via_dgels(X,Y,True)           # peak memory: 600.68 MiB, increment: 3.19 MiB
# %memit sls_svd_via_manual_dgelsd(X,Y,True)  # peak memory: 584.77 MiB, increment: 20.05 MiB
# for comparison:
# %memit np.linalg.lstsq(X,Y)                 # peak memory: 1082.57 MiB, increment: 517.70 MiB

And now, it’s time for a big happy dance with music: we avoided copies AND we stayed in row-major (C-land) order!

Timing

In [7]:
n_obs, n_ftrs = 2**16, 2**10
_X,_Y = create_data1(n_obs, n_ftrs, "C")

%timeit np.linalg.lstsq(_X,_Y)

# forcing copies since all of these calls place the results in the input (overwrite our data)
X,Y = np.copy(_X), np.copy(_Y)
%timeit sls_qr_via_dgels(X,Y,True)

X,Y = np.copy(_X), np.copy(_Y)
%timeit sls_svd_via_manual_dgelsd(X,Y,True)
1 loops, best of 3: 8.72 s per loop
1 loops, best of 3: 11.9 s per loop
1 loops, best of 3: 11.7 s per loop
In [8]:
import scipy.linalg as sla

X,Y = np.copy(_X), np.copy(_Y)
# NOTE: sla.lstsq uses dgelss insead of dgelsd
%timeit sla.lstsq(X,Y,overwrite_a=True,overwrite_b=True)
1 loops, best of 3: 28.7 s per loop

A few things stand out here.

Even though np.linalg.lstsq makes a copy (and it also has additional memory overhead to return its other results – it doesn’t overwrite), it is still faster at this problem size. I’ll tinker with finding a tradeoff point, but remember the point here was (largely) about memory and less about processing time (although the two are intimately intertwined).

My Python sls functions are not particularly optimized. The sls_qr_via_dgels function is very small. The sls_svd_via_manual_dgelsd, however, has a lot of machinery on the inside. I’ve left some of that machinery out of this post (which is already quite long and detailed). But, there is memory allocation, workspace (temporary) size calculation, and a variety of utility routines that I’ve only lightly tuned. Most of those are only executed once, so we’re not talking about inner loop performance hits.

Next, while scipy.linalg.lstsq can operate with overwriting, it uses a different underlying algorithm dgelss instead of dgelsd. That algorithm is a bit slower (than dgelsd), but it uses less memory.

LAPACK Glossary

There are many LAPACK routine names in the post. Here’s a quick guide to interpreting them.

Important abbreviations:

  • d: the matrix is made of double elements (double precision floating point)
  • ge: the matrix is a "general" matrix. It has no more specific structure like diagonal, triangular, orthogonal, etc.
  • or: we are dealing with an "orthogonal" matrix (the nicest propery of orthogonal matrices is that the inverse is equal to the, easy to compute, transpose)
  • lqf, qrf: \(LQ\) and \(QR\) "factorizations" (aka, decompositions)
  • ls: perform least squares

After some practice, you will start "seeing" the abbreviations embedded in the six letter FORTRAN names.

LAPACK Routine expanded name actions
dgels double general least squares solves least squares with \(QR\)
dgebrd double general bidiagonalization $X QBP^T $ with Q,P orthogonal and B bidiagonal
dormbr double orthogonal multiply br multiply a matrix by the Q or P produced from dgebrd
dlalsd double la least squares svd (?) solve least squares problem on a bidiagonal matrix using \(SVD\)
dgeqrf double general \(QR\) factorization ’nuff said
dgelqf double general \(LQ\) factorization ’nuff said
dormqr double orthogonal multiple qr multiply a matrix by the Q produced from dgeqrf
dormlq double orthogonal multiple lq multiply a matrix by the Q produced from dgelqf

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

Creative Commons License
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.