In the last post, we tried to use lower-level LAPACK/LAPACKE calls to remove some of the copying overhead of our operations. Well, it turns out that it didn’t work out so well (see the table at the end of the post). The fundamental reason why is that LAPACKE took the easy way out, when they decided to wrap LAPACK functions. In particular, whenever there is a C-/Fortran-ordering issue (for a reminder), LAPACKE simply makes a copy.

Diatribe

Honestly, I can’t believe it. LAPACK is it, when it comes to high-performance numerical libraries. And the "major, official" C-interface for LAPACK PUNTS on performance and makes a copy? In the modern world, Fortran is more and more of a historical footnote (outside of numerical analysis), C is almost a lingua franca underneath many other programming systems, and data is more likely to live in row-major order ("natively"), and "FLOPS are free" (memory is the bottleneck in many cases) … well, I won’t go on. Point is, if we – as engineers – want to make use of LAPACK proper and we live in C-land (which includes, at least, CPython), we have to make adjustments. Here, we’ll look at what those adjustments get us.

BTW, we (in C-land) don’t have to clutch at LAPACK forever. There are other libraries out there — unfortunately, some of them are so wedded to LAPACK that they too, keep an emphasis on column-major data. Two that are more flexible, or at least allow for data in row- or column-major order to be used with high-performance numerical algorithms without copying, are libFLAME and FLENS. Unfortunately, FLENS missed a great opportunity to move beyond the FORTRAN77 names (6 characters only) of raw BLAS and LAPACK. *sigh* Someone should make a readable interface around the FLENS library. One other, older project that looked interesting while I was researching these posts is Meschach which is a pure C numerical linear algebra library.

One last comment: all hope is not lost for C-to-LAPACK. Here are some hints in the slow moving world of "building a better mousetrap" for numerical linear algebra. Two are from the GSL (GNU Scientific Library) mailing list and one from the LAPACK mailing list itself:

https://sourceware.org/ml/gsl-discuss/2008-q3/msg00028.html
https://sourceware.org/ml/gsl-discuss/2008-q3/msg00029.html
and, especially, http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=14&t=4469

I think part of the issue is that the best people in the world for implementing (new) numerical methods – numerical methods researchers developing new, faster, more efficient, more stable algorithms – don’t actually use those methods very much. They don’t go on to write other, bigger systems in which the numerical method is "just" one cog. They quickly move on to making new, better numerical methods. And we should be thankful! (I am!) But, they implement the new methods in the language they’ve always used: F77. Then, they dump them in LAPACK and move on to the next research project. I’m not actually being critical. It’s just how the game works. But, for those of use on the application/user side of the coin, we’re left wanting.

Enough talk. Let’s fight. Err, sorry about that. What I meant was: let’s see if we can actually remove the copy. Here’s some example code from LAPACKE_work that demonstrates what happens (see lapacke_dgelsd_work lines 43-86).

     if( matrix_layout == LAPACK_COL_MAJOR ) {
            /* Call LAPACK function and adjust info */
            ...
     } else if( matrix_layout == LAPACK_ROW_MAJOR ) {
            ...
           /* Allocate memory for temporary array(s) */
           a_t = (double*)LAPACKE_malloc( sizeof(double) * lda_t * MAX(1,n) );
           ...
           b_t = (double*)LAPACKE_malloc( sizeof(double) * ldb_t * MAX(1,nrhs) );
           ...
           /* Transpose input matrices */
           LAPACKE_dge_trans( matrix_layout, m, n, a, lda, a_t, lda_t );
           LAPACKE_dge_trans( matrix_layout, MAX(m,n), nrhs, b, ldb, b_t, ldb_t );
           /* Call LAPACK function and adjust info */
           ...
     }

So, the punch-line is: if we want to avoid a transpose-copy, then we must call LAPACKE with a LAPACK_COL_MAJOR matrix. The easiest way to do that is to create our data in column-major (Fortran) order. In some respects, this also punts on the issue. If we have a large, C-based system, we can expect to work with our data in row-major order. But, we’ll take a mulligan there and look at it in another post. As a hint, can we compute the decomposition of a transposed matrix and answer questions about the original, non-transposed matrix. But that’s for next time. Let’s see what happens when we use Fortran-happy data.

Spin Up

In [1]:

import operator as op

import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla

from cvxopt import matrix as cvx_matrix
import cvxopt.lapack as cla

from liblapack import liblapack as lap
from cffi import FFI
ffi = FFI()

#
# numpy array to underlying C array
#

def get_dptr(arr):
    return ffi.cast("double *", arr.__array_interface__['data'][0])

def get_iptr(arr):
    return ffi.cast("int *", arr.__array_interface__['data'][0])

%load_ext memory_profiler

Data Creation (with an Ecumenical Flavor)

I was going to call this agnostic data creation, but really, it is creating data in either row- or column-major order. So, ecumenical felt right. Here goes. We have to rely on the correct interactions between np.dot and NumPy’s broadcasting mechanisms to properly apply our model (i.e., the coefficients) to our data. On lines 14 and 18, the net effect of swapping X.dot(coeffs) for coeffs.dot(X) is that we get \(row\ x\ col\) in the first case and \(row\ x\ row\) in the second. Two other notes on the code: (1) the single line def‘s are not often seen in Python – that also works for any : terminated, compound-statement starter and (2) on line #22, the argument to Y.reshape is selected by a ternery operator that picks the shape. This is to keep a nicely parallel X.T,Y.T in the return statement. If you can’t see line numbers, when you read this post, just load the raw notebook (there’s a link at the bottom), navigate to this cell, and hit Control-M followed by L.

In [2]:

def create_data(n=5000, p=100, order="C"):
    ''' 
        n is number cases/observations/examples
        p is number of features/attributes/variables

        order = "F" or "T" to create fortran/transposed data
    '''
    assert order in {"F", "T", "C"}

    coeffs = np.random.uniform(0,10, p)
    if order == "C":
        X = np.random.uniform(0,10, (n,p))
        # row x col
        def f(X): return X.dot(coeffs)        
    else:
        X = np.random.uniform(0,10, (p,n))
        # row x row
        def f(X): return coeffs.dot(X)

    noise = np.random.normal(0.0, 2.0, n)
    Y = f(X) + noise
    Y = Y.reshape((n,1) if order=="C" else (1,n)) # too cute

    return (X,Y) if order=="C" else (X.T, Y.T)

#
# Don't Forget Us!
#
n_obs = 2**16
n_ftrs = 2**10

Go Time

There are a few arguments that we have to pull out of X and Y for each LAPACKE call. This makes life easier below and it also let’s us massage away order problems in one spot.

In [3]:

#
# don't repeat yourself (not lazy - efficient! at least for the programmer)
# 
def make_common_args(X, Y, order="C"):
    m,n  = X.shape
    nrhs =  Y.shape[1] if Y.ndim == 2 else 1
    
    if order=="C":
        ldX, ldY, la_order = n, nrhs, lap.LAPACK_ROW_MAJOR
    else:
        ldX, ldY, la_order = m, m,    lap.LAPACK_COL_MAJOR

    return m, n, nrhs, ldX, ldY, la_order

And a couple of short routines to use the underlying decompositions via LAPACK routines whose primary duty is to solve least-squares. In sls_svd, we allocate some memory on the heap to store the singular values. LAPACK itself (not LAPACKE) goes to great lengths to avoid this. "All" (as near as I can tell) memory must be allocated in the parent caller – hence the need for WORK arrays.

In [4]:

def sls_qr(X, Y, order="C"):
    m, n, nrhs, ldX, ldY, la_order = make_common_args(X, Y, order)
    
    info = lap.LAPACKE_dgels(la_order,
                             "N", # do not transpose
                             m, n, nrhs,
                             get_dptr(X), ldX,
                             get_dptr(Y), ldY)
    assert not info
    return Y[:n]


def sls_svd(X,Y,order="C"):
    m, n, nrhs, ldX, ldY, la_order = make_common_args(X, Y, order)

    singular_values = np.empty(min(m,n))
    rank = np.zeros(1, dtype=np.int64)

    info = lap.LAPACKE_dgelsd(la_order,
                              m, n, nrhs,
                              get_dptr(X), ldX,
                              get_dptr(Y), ldY,
                              get_dptr(singular_values),
                              -1, 
                              get_iptr(rank))
    assert not info
    return Y[:n]

Same Results?

So, let’s run each of these routines with row- and column-major data and verify we get the same output.

In [5]:

#
# we have to copy the data so we can use the same data in the next call
# lapack/e overwrite its inputs with the results (decomps and betas)
#
def copy_and_do(func, _X, _Y, order="C"):
    X,Y = np.copy(_X, order), np.copy(_Y, order)
    return func(X,Y,order)

_X,_Y = create_data(n_obs, n_ftrs)

betas_qr_c  = copy_and_do(sls_qr,_X,_Y)
betas_qr_f  = copy_and_do(sls_qr,_X,_Y, "F")
betas_svd_c = copy_and_do(sls_svd,_X,_Y)
betas_svd_f = copy_and_do(sls_svd,_X,_Y, "F")

betas = [betas_qr_c, betas_qr_f, betas_svd_c, betas_svd_f]
print "All same betas?:", all(np.allclose(b1, b2) for b1 in betas[:-1] for b2 in betas[1:])

All same betas?: True

Memory Usage

In [6]:

# X,Y = create_data(n_obs, n_ftrs)
# %memit sls_qr( X,Y) # peak memory: 1084.82 MiB, increment: 513.27 MiB
# %memit sls_svd(X,Y) # peak memory: 1600.71 MiB, increment: 514.94 MiB

In [7]:

# X,Y = create_data(n_obs, n_ftrs, order='F')
# %memit sls_qr( X,Y, order='F') # peak memory: 574.21 MiB, increment: 3.11 MiB
# %memit sls_svd(X,Y, order='F') # peak memory: 574.00 MiB, increment: 2.90 MiB

And now, it’s time for a little happy dance: we avoided copies!

Timing

In [8]:

_X,_Y = create_data(n_obs, n_ftrs)

In [9]:

# forcing copies since all of these calls place the results in the input (overwrite our data)
X,Y = np.copy(_X), np.copy(_Y)
%timeit sls_qr(X,Y)

X,Y = np.copy(_X), np.copy(_Y)
%timeit sls_svd(X,Y)

1 loops, best of 3: 8.78 s per loop
1 loops, best of 3: 9.02 s per loop

In [10]:

# another way to get 'F' ordered data
X,Y = np.copy(_X, "F"), np.copy(_Y, "F")
%timeit sls_qr(X,Y, "F")

X,Y = np.copy(_X, "F"), np.copy(_Y, "F")
%timeit sls_svd(X,Y, "F")

1 loops, best of 3: 6.9 s per loop
1 loops, best of 3: 7.31 s per loop

Summary

So, at long last, we finally got rid of the (large) copy operation and are simply doing the work of the algorithm. Both QR and SVD still take "a lot" of time (especially compared to Cholesky which was about 1.5s in the last post). But, Cholesky is far less stable than QR and SVD. SVD is generally overkill for solving least-squares, but sometimes rouding (numerical) error can result in a singular (not invertible) \(R\) and then SVD is necessary. An analog of those small errors appears in SVD calculations, but buried inside the SVD algorithm is a parameter that determines when "small" becomes "zero". Our next step is to look at ways of dealing with the dreaded copy, without leaving C-land (or at least only leaving it when absolutely necessary). Then, we will look at the underlying algorithms themselves (oooooo, spooky!).

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.

drsfenner.org

Blogging our way through …

Into the Weeds II – LAPACK, LAPACKE, and Two Paths without Copies