Into the Weeds I – LAPACK, LAPACKE, and Three+ Paths

In the last post, we looked at three ways of solving the least-squares regression problem (Cholesky, QR, and SVD decompositions) using LAPACK routines in a round-about fashion. The different decompositions (i.e., factorizations) were performed by (1) calling a NumPy np.linalg routine that called (2) an LAPACK driver routine for a "bigger" problem like linear equation solving or least-squares. The driver routine called one of the decompositions internally.

LAPACK Routines

Since the QR and SVD solutions both used a copy of our \(X\) matrix (or at least the memory used hinted at that), I want to dive into the lower level of LAPACK (avoiding NumPy) and call the driver routines directly. We’ll try to avoid doing the decompositions by hand, if we can!. The most immediate issue, in this series of experiments is: how do we call the LAPACK routines? Here is a table of the LAPACK routines used by the solver/least squares routines in NumPy, SciPy, and cvxopt.

Python
Library
Routine
Underlying
LAPACK
Routine
Matrix Type Algorithm
sla.solve
w/ sym_pos=True
posv symmetric positive
definite matrix
solve eqn using Cholesky
nla.solve
sla.solve
gesv square matrix solve eqn using LU
cla.solve gels general matrix (full rank) least-squares via QR
sla.lstsq gelss general matrix least-squares via SVD
nla.lstsq gelsd general matrix least-squares via Divide-and-Conquer SVD

The different libraries make use of different LAPACK routines. And, in general, the libraries only expose the LAPACK routines that they need. For example, scipy.linalg does not provide dgels (as of scipy-0.16.1) it is not listed in the scipy.linalg.lapack function list. It is almost possible to pick and choose from each library to get the desired, underlying LAPACK routines, but I want to do something slightly different.

LAPACK: Fortran to C (Sharks/Jets? Montague/Capulet?)

One of the biggest obstacles to using LAPACK in Python is the fact that Python (at least CPython) and NumPy (and friends) are written in C. LAPACK is written in Fortran. This isn’t a show stopper. There are a number of interoperabiliity options out there: (1) converting Fortran to C (f2c program) and (2) linking Fortran binaries into a C program. The first is just a fancy pure-C program. The second generally requires declaring a C function prototype for the Fortran function you want to call (with something like: extern "C" void dgels(arg1, arg2, etc.)) and then simply calling the function. Here’s a quick overview.

Declaring your own prototypes is a real drag and it can also leave some holes in the conversion between C-land and Fortran-land. The biggest hole is the fact that Fortran stores data in column-major order, while C stores data in row-major order (wikipedia:Row-major). There are other issues as well and having a well-defined interface between the two layers is a really good idea. BLAS (lower level routines than LAPACK) has had one for years called CBLAS. However, it took significantly longer for LAPACK to get a standard C-interface: LAPACKE. A quick side-note: CLAPACK is a f2c‘d version of LAPACK, so it provides similar functionality in a different way.

LAPACKE and Python

After all that lead-in, my goal is to use LAPACKE to call LAPACK routines — from Python. It turns out that this is really easy. py-liblapack is a tiny shim built using a Python CFFI library called cffi. Here, CFFI stands for C foreign function interface and that acronym shows up all over the place. One example is LISP’s CFFI where the C stands for Common. Anyway, cffi let’s us do some really cool things:

  1. Define and allocate C-level objects (variables). For example, workspace = ffi.new("int[]", n) creates a Python reference to a length-n int array
  2. Read in declarations from a .h file: header_file = open("liblapack.h"); ffi.cdef(header_file.read())
  3. Access functions defined in a C library. For example, liblapack = ffi.dlopen("libreflapacke.so") gives us dot-access (via Python attributes) to functions in libreflapacke. We can use dgels by coding: liblapack.LAPACKE_dgels.

These three capabilities let mikefc write an ~10 line script that exposes LAPACKE for us in Python. Granted, more work went into making a cffi-compatible set of .h files for the LAPACKE routines, but that was done by another set of Python scripts. py-liblapack exposes all of the single and double (s and d) routines from LAPACKE; it would be straight-forward to also expose the complex and double-precision complex (c and z) routines.

To use the py-liblapack code, all you need are two files from the source repository: liblapack.py and liblapack.h. Then, put the proper file names in liblapack.py:

import os
from cffi import FFI
ffi = FFI()

this_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
with open(this_dir + "liblapack.h") as header_file:
    ffi.cdef(header_file.read())
liblapack = ffi.dlopen("libreflapacke.so")

In the examples below, I’ll assume that liblapack.py lives in the same directory as my notebook and I’ll be importing it, shortly.

Generic Setup and Imports

In [1]:
import operator as op

import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla

from cvxopt import matrix as cvx_matrix
import cvxopt.lapack as cla

%load_ext memory_profiler
In [2]:
def create_data(n=5000, p=100):
    ''' 
        n is number cases/observations/examples
        p is number of features/attributes/variables
    '''
    X      = np.random.uniform(0,10, (n,p))
    coeffs = np.random.uniform(0,10, p)
    def f(X): return X.dot(coeffs)

    noise = np.random.normal(0.0, 2.0, n)
    Y = f(X) + noise
    Y = Y.reshape((n,1))
    
    return X,Y
In [3]:
n_obs = 2**16
n_ftrs = 2**10
X,Y = create_data(n_obs,n_ftrs)
print "about", 10**int(np.log10(n_obs * n_ftrs)), "entries"
print "about", (X.itemsize * X.size) / float(1024**2), "MB"
about 10000000 entries
about 512.0 MB

liblapack and Utilities

In [4]:
from liblapack import liblapack as lap
from cffi import FFI
ffi = FFI()

#
# numpy array to underlying C array
#
def get_dptr(arr):
    return ffi.cast("double *", arr.__array_interface__['data'][0])

def get_iptr(arr):
    return ffi.cast("int *", arr.__array_interface__['data'][0])

Cholesky via dposv

In [5]:
# posv
# http://www.netlib.org/lapack/explore-html/dc/d7f/lapacke__dposv_8c.html
# https://software.intel.com/en-us/node/520982
# 
# lapack_int LAPACKE_dposv(int matrix_layout,
#                          char uplo,
#                          lapack_int n, lapack_int nrhs,
#                          double *a, lapack_int lda,
#                          double *b, lapack_int ldb)
def sls_cholesky_orig(X, Y):
    XtX = X.T.dot(X)
    XtY = X.T.dot(Y)

    m,n = XtX.shape
    nrhs = XtY.shape[1] if XtY.ndim == 2 else 1

    info = lap.LAPACKE_dposv(lap.LAPACK_ROW_MAJOR,
                             "L", # prefer L.dot(L.T) factorization
                             n, nrhs,
                             get_dptr(XtX), n,    # ldXtX = n
                             get_dptr(XtY), nrhs) # ldXtY = nrhs
    assert not info
    return XtY[:X.shape[1]]
In [6]:
# %memit sls_cholesky_orig(X,Y)
# peak memory: 581.86 MiB, increment: 11.87 MiB

LU via dgesv

In [7]:
# gesv
# http://www.netlib.org/lapack/explore-html/d6/de8/lapacke__dgesv_8c.html
# https://software.intel.com/en-us/node/520973
#
#lapack_int LAPACKE_dgesv(int matrix_layout,
#                         lapack_int n, lapack_int nrhs,
#                         double *a, lapack_int lda,
#                         lapack_int *ipiv,
#                         double *b, lapack_int ldb)
def sls_lu_orig(X,Y):
    m,n = X.shape
    nrhs = Y.shape[1] if Y.ndim == 2 else 1

    workspace = ffi.new("int[]", n) # workspace array

    info = lap.LAPACKE_dgesv(lap.LAPACK_ROW_MAJOR,
                             n, nrhs,
                             get_dptr(X), n,    # ldX = n
                             workspace,
                             get_dptr(Y), nrhs) # ldY = nrhs

    assert not info
    return Y[:n]
In [8]:
# %memit sls_orig(X,Y)
# peak memory: 576.93 MiB, increment: 6.71 MiB

QR via dgels

In [9]:
# gels
# http://www.netlib.org/lapack/explore-html/d9/da2/lapacke__dgels_8c.html
# https://software.intel.com/en-us/node/521112
#
#
#lapack_int LAPACKE_dgels(int matrix_layout,
#                         char trans,
#                         lapack_int m, lapack_int n, lapack_int nrhs,
#                         double* a, lapack_int lda,
#                         double* b, lapack_int ldb);
def sls_qr_orig(X,Y):
    m,n  = X.shape
    nrhs = Y.shape[1] if Y.ndim == 2 else 1
    info = lap.LAPACKE_dgels(lap.LAPACK_ROW_MAJOR,
                             "N", # do not transpose
                             m, n, nrhs,
                             get_dptr(X), n,
                             get_dptr(Y), nrhs)

    assert not info
    return Y[:n]
In [10]:
# %memit sls_qr_orig(X,Y)
# peak memory: 1083.94 MiB, increment: 514.06 MiB

QR via dgels: Take Two

In [11]:
def sls_qr_minwork(X,Y):
    m,n  = X.shape
    nrhs = Y.shape[1] if Y.ndim == 2 else 1
    

    # http://www.netlib.org/lapack/explore-html/d8/dde/dgels_8f_source.html
    # min LWORK >= max( 1, MN + max( MN, NRHS ) )
    work_size = X.size + max(X.size,nrhs)
    work = np.zeros(work_size)
    
    info = lap.LAPACKE_dgels_work(lap.LAPACK_ROW_MAJOR,
                                 "N", # do not transpose
                                 m, n, nrhs,
                                 get_dptr(X), n,
                                 get_dptr(Y), nrhs,
                                 get_dptr(work), work_size)

    assert not info
    return Y[:n]
In [12]:
# %memit sls_qr_minwork(X,Y)
#peak memory: 1084.91 MiB, increment: 517.84 MiB

SVD via dgelss

In [13]:
# gelss
#
#
# http://www.netlib.org/lapack/explore-html/db/d77/lapacke__dgelss_8c.html
# https://software.intel.com/en-us/node/521114
#
#
#lapack_int LAPACKE_dgelss(int matrix_layout, 
#                          lapack_int m, lapack_int n, lapack_int nrhs, 
#                          double* a, lapack_int lda, 
#                          double* b, lapack_int ldb, 
#                          double* s, 
#                          double rcond, 
#                          lapack_int* rank )
#
def sls_svd_dgelss(X,Y):
    m,n = X.shape
    nrhs = Y.shape[1] if Y.ndim == 2 else 1

    singular_values = np.empty(min(m,n))
    rank = np.zeros(1, dtype=np.int64)

    info = lap.LAPACKE_dgelss(lap.LAPACK_ROW_MAJOR,
                              m,n,nrhs,
                              get_dptr(X), n,
                              get_dptr(Y), nrhs,
                              get_dptr(singular_values),
                              -1, 
                              get_iptr(rank))
    return Y[:n]
In [14]:
# %memit sls_svd_dgelss(X,Y)
# peak memory: 1083.97 MiB, increment: 513.99 MiB

SVD via dgelsd

In [15]:
# gelsd
#
# http://www.netlib.org/lapack/explore-html/df/de3/lapacke__dgelsd_8c.html
# https://software.intel.com/en-us/node/521115
#
#
#lapack_int LAPACKE_dgelsd(int matrix_layout,
#                          lapack_int m, lapack_int n, lapack_int nrhs,
#                          double* a, lapack_int lda,
#                          double* b, lapack_int ldb,
#                          double* s,
#                          double rcond,
#                          lapack_int* rank );
#
def sls_svd_orig(X,Y):
    m,n = X.shape
    nrhs = Y.shape[1] if Y.ndim == 2 else 1

    singular_values = np.empty(min(m,n))
    rank = np.zeros(1, dtype=np.int64)

    info = lap.LAPACKE_dgelsd(lap.LAPACK_ROW_MAJOR,
                              m,n,nrhs,
                              get_dptr(X), n,
                              get_dptr(Y), nrhs,
                              get_dptr(singular_values),
                              -1, 
                              get_iptr(rank))
    return Y[:n]
In [16]:
# %memit sls_svd_orig(X,Y)
#peak memory: 1084.62 MiB, increment: 514.18 MiB

Check for Correctness

Here, we test the code to make sure they all give the same \(\beta\)s. Note that we make a copy because most of these routines modify X adn Y in-place. This differs a bit from our semantics of early posts, but it isn’t terribly important because within this post, we are still comparing apples to apples. When we go back to the outside world, we may still need to make copies, depending on our use-cases for X and Y.

In [17]:
_X,_Y = create_data(n_obs,n_ftrs)

X, Y = np.copy(_X), np.copy(_Y)
betas_chol = sls_cholesky_orig(X,Y)

X, Y = np.copy(_X), np.copy(_Y)
betas_qr = sls_qr_orig(X,Y)

X, Y = np.copy(_X), np.copy(_Y)
betas_svd = sls_svd_orig(X,Y)

betas = [betas_chol, betas_qr, betas_svd]
print "All same betas?:", all(np.allclose(b1, b2) for b1 in betas[:-1] for b2 in betas[1:])
All same betas?: True

And Do a Quick Timing Run

In [18]:
%timeit sls_cholesky_orig(X,Y)
%timeit sls_lu_orig(X,Y)
%timeit sls_qr_orig(X,Y)
%timeit sls_qr_minwork(X,Y)
%timeit sls_svd_dgelss(X,Y)
%timeit sls_svd_orig(X,Y)
1 loops, best of 3: 1.54 s per loop
10 loops, best of 3: 32.6 ms per loop
1 loops, best of 3: 8.99 s per loop
1 loops, best of 3: 8.77 s per loop
1 loops, best of 3: 9.17 s per loop
1 loops, best of 3: 9.19 s per loop

Summary Results

Algorithm Called Via Memory
Increment(MB)
~Time(s)
Cholesky dposv 12 1.5
LU dgesv 7 >0.0
QR dgels 514 9
QR (min LWORK) dgels 517 9
SVD dgelss 514 9
SVD dgelsd 514 9

As you can see, things are still not "right". And there are some hints running around in the code and discussion above as to what the remaining issue is. I won’t spill the beans, but stay tuned for next time.

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

Creative Commons License
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.