SVD Computation Capstone

At long last, I’ve gotten to my original goal: writing a relatively easy to understand version of your plain, vanilla SVD computation. I’m going to spend just a few minutes glossing (literally) over the theory behind the code and then dive into some implementation that builds on the previous posts. Thanks for reading!

Words of Thanks and Caution

I had really hoped to minimize the hand-waving that I’m about to do. To really get into this, you need to start by reading pages 390 to 450 of Golub & VanLoan (3rd) – definitely section 8.1-8.3 before starting on 8.6. And, even if you do that, you’ll probably be at a loss as to what the heck it all means — if you don’t bail for beer-thirty, first. Atkinson, Intro. to Numerical Analysis, 2nd, pages 587-633 was also very helpful in understanding the theory behind the code (and it was a trip down memory lane to my numerical analysis course, fifteen+ years ago).

I found a few resources online that smoothed out some of (my?) rough edges:

I’m saying nothing about the applications (or general description) of SVD in this post. There are 100s (or more) of descriptions out there. Application wise, I’m most interested in SVD for regression problems, but there are many other uses. This post is interested in calculating the SVD. And that requires a bit of theory.

I wanted to actually demonstrate (you know, some running programs) the similarities between a "long winded" (non-specialized QR – see below) and the "short and sweet" SVD calculations. I may do that in a later post. But for now, we’ll have to be satisfied with a verbal explanation (aka, hand-waving). A warning: the following paragraphs summarize some 40-50 pages of mathematical textbook presentation. I probably made a couple mistakes and left an important concept by the wayside. However the code that follows should be reasonably clear and is (somewhat) tested. I just wanted to present the mathematical theory with a minimum of, ahem, theory.

Overview of QR Based SVD Calculations

Here goes:

It is fairly easy (ha!) to solve the symmetric eigenvalue problem. Given a symmetric matrix, \(A=A^T\), we can diagonalize it with a orthogonal similarity transformation \(Q^T_DAQ_D=D\) with \(D\) diagonal. This is usually done in two steps: (1) \(A\rightarrow Q_{Tridi}AQ^T_{Tridi}=T\) with \(T\) symmetric and tridiagonal, followed by (2) applying the \(QR\)-iteration to \(T\). (1) is done with series of Householder reflections. What about (2)?

The \(QR\)-iteration works like this. Using a QR decomposition, let \(T=T_1 = Q_{q_1} R_{r_1}\) and then construct \(T_2=R_{r_1} Q_{q_1}\) by multiplying in the reverse order. We can eliminate \(R_{r_1}\) from that definition of \(T_2\): \(T_2 = Q^T_{q_1} T_1 Q_{q_1}\) and even better, that is a similarity transformation from \(T_1 \rightarrow T_2\). This means that they share eigenvalues. If we repeat this process many times, \(T_i\) will converge to a diagonal matrix since \(A\) was symmetric.

A side note: in general, for a not-necessarily-symmetric \(A\), the initial reduction is to a Hessenberg matrix (non-zero below the first subdiagonal; an upper triangular matrix extended to the first subdiagonal) and the \(QR\)-iteration converges to an upper triangular matrix.

The summary of this is that if \(Q_{qr}=\prod_i Q_{r_i}\), then we have \(Q^T_{Tridi}Q^T_{qr} A Q_{qr} Q_{Tridi} = D\) is diagonal. So, the original \(Q\) we sought is \(Q_{qr}Q_{Tridi}\). Oh yeah, since this is all one big similarity transformation, the eigenvalues of \(A\) and \(D\) are the same: \(\mathrm{eig}(A)=\mathrm{eig}(D)\). So, diagonalization of a symmetric matrix gets us the eigenvalues of the matrix.

One other note: the tridiagonal matrix that is input to the \(QR\)-iteration needs to be non-zero on its sub-diagonal (the diagonal below the main diagonal) elements. If any of them are zero, the \(QR\)-iteration will not necessarily converge. However, those zeros allow us to solve sub-problems above and below the row with the zero off-diagonal entry. In that case, the eigenvalues of the whole matrix are equal to the union of the eigenvalues of the sub-problems. Easy enough.

Wait. Why do we Care?

Well, for a matrix \(A\): \(\mathrm{singular\_values}(A)^2=\mathrm{eig}(A^TA)=\mathrm{eig}(AA^T)\) (if \(\mathrm{eig}\) returns non-zero eigenvalues). If we have the SVD of \(A\), we can see that \(A = UDV^T\) gives \(A^TA=(UDV^T)^T UDV^T= V D^T U^T U D V^T = V D^T D V^T = V D^2 V^T\). And we have the squares of \(D\)‘s values in the last equality.

Fortunately for us, \(A^TA\) is symmetric. Unfortunately, we can’t just apply the above eigenvalue computation to \(A^TA\) because we run into the same condition number (numerical stability) and memory use problems we would hit with solving the normal equations directly. So … what do we do?

Let’s break up the symmetric eigenvalue problem into these steps. With an input matrix \(A\):

  1. Reduce \(A\) to tridiagonal \(T\),
  2. Perform the first step of the \(QR\) iteration to find and apply \(Q_{q_1}\),
  3. Perform the remainder of \(QR\) iteration until convergence, finding \(Q_{q_i}\) as in step 2.

The total transformations on \(A\) come from reduction in step 1 and then repeating step 2 over and over. Let \(Q = \prod Q_{q_i}\ \ Q_{Tri}\). Then, \(Q^T A Q=D\).

Here’s some equivalent steps for the SVD on an input matrix \(A\) (we need \(\mathrm{eig}(A^TA)\)):

  1. Reduce \(A\) to bidiagonal \(B\) (note, \(B^TB\) is tridiagonal) using \(Q_{Bidiag}\),
  2. Do some fancy footwork to compute \(Q_{q_1}\) as above (we only need parts of \(B^TB\), not the whole thing) and then apply \(Q_{q_1}\) to \(B\),
  3. \(B Q_{q_1}\) is no longer bidiagonal. Use bounce_blemish_down_diagonal (from the Givens/Blemish post) to restore it to blissful bidiagonality. Store or accumulate the products of the left \(G_{L_i}\) and right \(G_{R_i}\) Givens transformations that bouncing uses. You can see in bounce_blemish_down_diagonal that these come in (left, right) pairs followed by a last left,
  4. Perform the remainder of this iteration until convergence, finding \(Q_{q_i}\), \(G_{L_i}\), and \(G_{R_i}\) as in steps 2 and 3.

The total transformations on \(B\) come from repeating steps 2 and 3 over and over. Let \(L = \prod G_{L_i}\) and \(R = Q_{q_i} \prod G_{R_i}\). Then, more or less, \(L^T B R = D\). Depending on your specific accounting/accumulating and the return values you need, you might want different transposes in there. We often care about the form \(A=L_{all} D R^T_{all}\). These are the values returned by, for example, LAPACK’s SVD routines (and hence, NumPy, SciPy, matlab, etc.). We get that by wrapping up (multiplying) the transformations that gave us \(A \rightarrow B\) and the \(QR\)-step iterations into \(L_{all}\) and \(R_{all}\). For example, \(L_{all}=L Q_{Bidiag}\)

By Hook or Crook: Essentially Similar

By a wonderful theorem (the Implicit Q Theorem), since we started out with \(Q_{q_1}\) in both cases (and \(B\) and \(T\) were "full"/unreduced and both processes end up with a tridiagonal – see G&VL for details), the entire sequence of transformations (generated using \(A^TA\) in the QR process and \(A\)+fancy-footwork in the SVD process) will be "essentially similar" meaning they differ only by some signs on the diagonals.

If we take \(T=B^TB\) at the start, then an update to \(T_{i+1}=Q^T_i T_i Q_i\) is related to \(B_i\) by applying the same \(Q_i\) to \(B_i\) and then "fixing up" \(B_i\) with an orthogonal \(U_i\) to maintain its bidiagonal form. So, \(B_{i+1}=U^T_i B_i Q_i\).

\[T_{i+1}=Q^T_i T_i Q_i = Q^T_i B^T_i B_i Q_i = Q^T_i B^T_i {U^T_i}^T U_i^T B_i Q_i
=Q^T_i B^T_i U_i U^T_i B_i Q_i = B^T_{i+1} B_{i+1}\]

It’s pretty clear that that pairs \(L,R\) (by construction/algorithm above) and \(Q,U\) (by "proof of possibility" – or an existence claim) are not directly related on a step-by-step basis. The existence claim is demonstrated by a series of equalities and it simply says some \(U,Q\) can hold that equality between \(B^T_iB_i\) and \(T_i\). The ones we find with our QR and SVD iterations are different. Most importantly, our "fixing up" is not all in \(U\) (the orthogonal matrix on the interior of the expression); fixing up happens on both transformation, \(L\) and \(R\). Regardless, once we gather them all up, they (\(L\) and \(Q\)) become essentially similar and, importantly, they lead to the correct \(D\).

Roll Call

We are going to make use of many of the friends that we developed of the past several posts. For simplicity, I placed them (and an occasional modification or two) in ghbs_utils.py. That file is available at: http://drsfenner.org/public/notebooks/additional/ghbs_utils.py. Because of this import, I don’t believe this notebook will run (unmodified) on nbviewer.ipython.org. But, you can still download it and run it, see the link at the bottom of this post.

In [1]:
import numpy as np
import numpy.linalg as nla
from pprint import pprint

np.set_printoptions(suppress=True)
In [2]:
# ghbs_utils.py available at:  
# http://drsfenner.org/public/notebooks/additional/ghbs_utils.py
from ghbs_utils import (bounce_blemish_down_diagonal, decouple_or_reduce,
                        zeroing_givens_coeffs, left_givensT, right_givens, 
                        apply_givens_lefts_on_right, apply_givens_rights_on_left,
                        make_upper_bidiag, extract_upper_bidiag,
                        house_bidiag, extract_packed_house_bidiag)

SVD Helpers I: Finding Last Non-Zero Runs in a Sequence

When we attempt to apply the \(QR\)-iteration above, we have to have a "full" super-diagonal. We can find such a diagonal by looking for runs of non-zeros from the end of the super-diagonal backwards. The following code does that.

In [3]:
def find_first_run_boolean(vec, front=True):
    ''' use front=False if you will want a run from the end'''
    lbl = xrange(len(vec))
    if not front:
        vec = reversed(vec)
        lbl = reversed(lbl)
    
    foundStart = False
    for idx, itm in zip(lbl, vec):
        if foundStart:
            if itm:
                end = idx
            else:
                break
        else:
            if itm:
                start = end = idx
                foundStart = True
    if not foundStart:
        return slice(0,0) # equiv to slice(None, 0) [empty]
    if not front:
        start, end = end, start # if we counted from end, swap
    return slice(start, end+1)  # allow for slicing use
In [4]:
testcases = [(np.array([0,0,1,2,3,4,5,19,20]), slice(2,9,None)),
             (np.zeros(5), slice(0,0,None)),
             (np.array([1,2,3,0,0,4,5]), slice(0,3,None)),
             (np.array([0,0,0,0,10]), slice(4,5,None)),
            ]

print all(t_out == find_first_run_boolean(t_in, front=True) for t_in, t_out in testcases)
#for t_input, t_output in testcases:
#    sli = find_first_run_boolean(t_input, front=True)
#    print sli == t_output
True
In [5]:
testcases = [(np.array([0,0,1,2,3,4,5,19,20]), slice(2,9,None)),
             (np.zeros(5), slice(0,0,None)),
             (np.array([1,2,3,0,0,4,5]), slice(5,7,None)),
             (np.array([0,0,0,0,10]), slice(4,5,None)),
            ]
print all(t_out == find_first_run_boolean(t_in, front=False) for t_in, t_out in testcases)
#for t_input, t_output in testcases:
#    sli = find_first_run_boolean(t_input, front=False)
#    print sli == t_output
True

SVD Helpers II: Trimming Tiny Values

There are several possibilities for determining when a value is numerically small enough to consider it "practically" zero. G&VL do so by comparing an super-diagonal element to its diagonal "friend" (on diagonal in the same row) and the diagonal element in the row below it. I also use this opportunity to remove small values on the diagonal itself (even though G&VL don’t mention that step in their algorithm).

One trick that this code uses is the ndarray.flat iterator. This is a view into the entire array as a flat entity. By starting at index 1 and taking steps of in the size of the number of elements in a row plus one, we get each super-diagonal entry. There is a NumPy function to do similar for extraction (np.diag), but it can’t be used as a target for an assignment. So, I use ndarray.flat several times to give a feel for what it is doing. A side note: there is also np.diag_indices which I really dislike, but your mileage might vary!

In [6]:
# this is a factored-out function for the 
# first loop step in G&VL Algo. 8.6.2
def remove_small(B, eps):
    m,n = B.shape
    diag          = B.flat[ ::n+1] # np.diag is fine, but want to explain B.flat below
    abs_diag      = np.abs(diag)
    B.flat[::n+1] = np.where(abs_diag < eps, 0.0, diag)

    super_diag     = B.flat[1::n+1]
    
    # superdiag -> 0 if superdiag <= eps * (this row diag + next row diag)
    # where eps is small multiple of unit roundoff
    small_val = eps * (abs_diag[:-1] + abs_diag[1:]) # or just = eps
    B.flat[1::n+1] = np.where(np.abs(super_diag) < small_val, 0.0, super_diag)

SVD Helpers III: Cleaning and Partitioning

Here we have the code that is responsible for find the biggest unsolved subproblem towards the lower-right corner of our input matrix. Everything further to the lower-right is already reduced to a pure diagonal; hence it already has its singular values computed. Everything to the upper-left still needs to be examined.

In [7]:
# this is a factored-out function for the 
# first two steps in G&VL Algo. 8.6.2
def clean_and_partition(B, eps=.00001):
    '''return both the sub-matrix and the index it starts at'''
    remove_small(B, eps=eps)
    
    m,n = B.shape
    super_diag = B.flat[1::n+1]
    super_slice = find_first_run_boolean(super_diag, front=False)
    if not super_slice.stop:
        return np.empty(0), 0
        
    # based on the super diagonal, we want 
    full_slice = slice(super_slice.start, super_slice.stop + 1)
    return B[full_slice, full_slice], super_slice.start
In [8]:
ubd = make_upper_bidiag(np.arange(8)*2 + 10, np.arange(7)*2+1)
ubd[3,4] = 0
print ubd

B,k = clean_and_partition(ubd, .01)
print B
print k
[[10  1  0  0  0  0  0  0]
 [ 0 12  3  0  0  0  0  0]
 [ 0  0 14  5  0  0  0  0]
 [ 0  0  0 16  0  0  0  0]
 [ 0  0  0  0 18  9  0  0]
 [ 0  0  0  0  0 20 11  0]
 [ 0  0  0  0  0  0 22 13]
 [ 0  0  0  0  0  0  0 24]]
[[18  9  0  0]
 [ 0 20 11  0]
 [ 0  0 22 13]
 [ 0  0  0 24]]
4

Pushing Towards Diagonality

One thing I completely avoided discussing above is the notion of the "Wilkinson shift". Essentially, this is a technique that speeds the convergence of the \(QR\)-iteration. See G&VL, 3rd, pg. 418-420 for details. All it does is find an eigenvalue of the lower-right \(2\ x\ 2\) matrix that is closer to the bottom-right element and subtract that from our diagonal entries.

In [9]:
# G&VL algorithm 8.6.1 on page ~ 454 .. 
def step_full_bidiagonal_towards_diagonal(B):
    # let T <- trailing 2x2 submatrix of B.T.dot(B)
    # T = B.T.dot(B)[-2:,-2:]  .... row by col rule means we
    #     need last two rows of B.T (which are last two cols of B)
    #     and last two cols of B (which means we just need ....)
    lastCols = B[:, -2:]
    T = lastCols.T.dot(lastCols)
    
    # compute wilkinson shift value ("mu")
    d  = (T[-2,-2] - T[-1,-1]) / 2.0
    shift = T[-1,-1] - T[-1,-2]**2 / (d + np.sign(d)*np.sqrt(d**2+T[-1,-2]**2))
    
    # special case givens_zero_right(B) on shift
    # the only explict part of T we need in this reduction step
    T_00 = np.sum(B.T[0,:]**2)
    T_01 = np.dot(B.T[0,:], B[:,1]) #B^T B --> B.T[0,:], B[:,1] 

    c,s = zeroing_givens_coeffs(T_00-shift, T_01)
    local_right = (c,s,0,1) # store for later
    right_givens(B, local_right)
    
    lefts, rights = bounce_blemish_down_diagonal(B)
    rights = [local_right] + rights # BONUS:  remove the copy (deque or reverse order)
    return lefts, rights

Note the order we apply lefts and rights. Since we are accumulating them "outside-in", we need to apply them "looking out" to the accumulating U and V. If we applied them to A, we could do them in the expected order, but we wouldn’t have separate left and right transformations. apply_givens_lefts_on_right and its friend take care of these details. They are simply an accumulator loop starting with an identity matrix and multiplying in Givens rotations.

In [10]:
A = make_upper_bidiag(np.arange(4.0)+1, np.ones(4))
A_orig = A.copy()

L,  R  = step_full_bidiagonal_towards_diagonal(A)
L2, R2 = step_full_bidiagonal_towards_diagonal(A)

# gather rotations into U and V
L.extend(L2)
R.extend(R2)
U,V = np.eye(4), np.eye(4)
apply_givens_lefts_on_right(U, L, 0)
apply_givens_rights_on_left(R, V, 0)

# seems to be good enough
print np.allclose(A_orig, U.dot(A).dot(V))
True

And Full SVD

With all the pieces in order, the find SVD algorithm isn’t too long. Having names for the sub-pieces, also gives us a mental break trying to interpret what is happening at each step. I left a few commented asserts in the code. It is a nice mental reminder that we are maintaining the equality wiht the original input throughout this process. Salud!

In [11]:
# GvL    page 454 - 456  algo 8.6.2 (full svd)
def clear_svd(A, eps=0.001): # eps=np.finfo(np.double).eps):
    m,n = A.shape
    
    # for safety check:  A_orig = A.copy()
    both_betas = house_bidiag(A) # betas for U and Vt
    
    # this sends back V; not worrying about it
    U, B, V = extract_packed_house_bidiag(A, *both_betas)
    # check:  assert np.allclose(A_orig, U.dot(B).dot(V.T))    
    Vt = V.T 
    
    # B22 is the bottom-right-most "full bidiagonal" square block matrix of B
    # it is a view into B, so when B22 is updated, so is B
    B22, k = clean_and_partition(B,eps=eps)
    while B22.size:
        zero_idxs = np.where(np.abs(np.diag(B22)) < eps)[0]
        if zero_idxs.size:
            lefts, rights = decouple_or_reduce(B22, zero_idxs)
        else:
            lefts, rights = step_full_bidiagonal_towards_diagonal(B22)
        
        # this feels weird, but since U and Vt are accumulating the effects on
        # B22, we develop them on one side (on B22) and apply them on the respective
        # (outer) transformation (U or Vt)
        apply_givens_lefts_on_right(U, lefts,   shift_dex=k)
        apply_givens_rights_on_left(rights, Vt, shift_dex=k)

        # check:  assert np.allclose(A_orig, U.dot(B).dot(Vt))
        
        B22, k = clean_and_partition(B,eps=eps)
    return U, B, Vt

Some Tests

In [12]:
def test_check(M):
    expected_S = nla.svd(M)[1]
    
    orig_M = M.copy()
    U, S, Vt = clear_svd(M)
    
    return (np.allclose(sorted(expected_S), sorted(np.diag(S))) and
            np.allclose(orig_M, U.dot(S).dot(Vt)))
In [13]:
test_matrices = [np.diag(np.arange(5.0)),
                 np.arange(16.0).reshape(4,4),

                 np.array([[0, -2.0],    # caused grief because of error in BBM/Givens
                           [0,    0]]),

                 np.array([[0,0,0,0],    # comes from applying 2 givens rotations to diag
                           [0,0,2.12, -2.12],
                           [0,0,1.414, 1.414],
                           [0,0,0,0]])]

for A in test_matrices:
    print test_check(A)
True
True
True
True

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

Creative Commons License
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.