LinearRegression And Notation

The precision of mathematical notation is a two-edged sword. On one side, it should clearly and concisely represent formal statements of mathematics. On the other side, sloppy notation, abbreviations, and time/space-saving shortcuts (mainly in the writing of mathematics) can lead to ambiguity — which is a time sink for the reader! Even acceptable mathematical notation can hide important aspects of the mathematics. Notation can leave underlying assumptions hidden from view, and these assumptions may not be revealed until a problem solution, working implementation, or model application reveal the deficiency. Since I want to write a few posts on linear regression, I’m going to look at an example of the downstream effects that notation can cause …

Since I happen to have Hastie, et al.’s learning book (The Elements of Statistical Learning) open and in front of me, I’m going to follow along with their easy walkthrough of linear regression (namely, multiple linear regression with least squares). Immediately, they start out with a problem representation and then they quickly change the representation on you. And, I recall this being a hang up for me (even in grad. school). Maybe I’m admitting I’m not the quickest mathematician out there (I’m not — you can figure out whether that’s a statement about my admission or my mathematical skills). But I’m hoping to add a pedagogical and computational flavor to the normal presentations of mathematical machine learning (which has a high degree of overlap with computational statistics).

The Case of the Disappearing Coefficient

So, here goes. In the first formulation, we have (2.1):
\[\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j\]

In the second formulation (2.2), we have \[\hat{Y} = \sum_{j=0}^p X_j \hat{\beta}_j\]

It might be fairly clear (how’s that for equivocation) that the \(\hat{\beta}_0\) has been rolled into the summation. However, what is less clear, and possibly lost in the text, is:

Often it is convenient to include the constant variable 1 in \(X\) and include \(\hat{\beta}_0\) in the vector of coefficients \(\hat{\beta}\)

Nod your head, act like you know what’s going on (nothing to see here), and move along. Maybe you (advanced reader) feel like I’m overstating the confusion such changes can instigate. I’m not. Having taught undergraduates (strong, weak, technical, atechnical, and everyone in-between) for 15 years, I’m not overstating. Less tongue-in-cheek, this modification of \(X\) can also be looked at as a transition from the data matrix to a design matrix.

There are also deeper (modeling) issues that occur with the presence or absence of \(\hat{\beta}_0\) (note, both formulas above have it present — just in different forms). \(\hat{\beta}_0\) is called the constant or intercept term. Briefly (at the risk of being too brief and misleading), if (1) \({\beta}_0=0\) (you have background information that the actual value is zero), (2) you are using a different type of linear model – anova/categorical data, or (3) using standardized data, you can get away without \(\hat{\beta}_0\). Otherwise, you want to have the 0-th term because it will give unbiased \(\hat{\beta}\)s and zero-mean residuals. (Double check those if you have a dissertation defense coming up — or are working on a medical system.)

Pulling Back the Covers

So, let’s figure this out. I’m going use HTF’s notation, unless otherwise stated. Here’s a simple example to get us started. Here is a single case or observation \((X,Y)\) with 3 input (independent) variables or attributes \(X=(X_1,X_2,X_3)\) and an output (dependent) variable \(Y\). Our goal will be to take measurement \(X\) and predict an output \(\hat{Y}\) for cases where we haven’t already seen \(Y\).

Obs \(X_1\) \(X_2\) \(X_3\) \(Y\)
1 7.01 3.46 4.21 25.22

More generally, \(X\) can have \(p\)-dimensions (our example above has \(p=3\)). With p-dimensions, \(X=(X_1, X_2, \dots, X_p)\). We started our indices at 1; we’re going to monkey with that in a second. You might see this general setup as “a problem with a \(p\)-dimensional input space”.

What about the other terms \(\hat{\beta}_i\)? These are the coefficients we are trying to estimate. When we fill them in with estimated values (e.g., \(\hat{\beta}_1=\hat{\beta}_2=1.25\) and \(\hat{\beta}_0=\hat{\beta}_3=0.5\)), we have an equation that is linear in the \(X_i\) — we’re drawing a line (or a higher dimensional analogue). With those \(\hat{\beta}_i\), we have: \[\hat{Y} = \frac{1}{2} + 1.25X_1 + 1.25X_2 + \frac{1}{2}X_3\]

When we fill in the \(X_i\) with values from a particular observation, we get out one predicted value for \(Y\). For the case above, we would predict:

In [1]:
Y_hat = .5 + 1.25*7.01 + 1.25*3.46 + .5*4.21
print(Y_hat)
15.6925

Note that this is not a perfect prediction — I pulled the \(\hat{\beta}_i\) out of a hat (hah!). Usually, when we try to fit a line to the data, we don’t do a perfect job, but we do tend to improve over random guessing.

Sleuthing

Hopefully, it is obvious that \(\hat{\beta}_0=1\hat{\beta}_0\). So, \(X_0=1\) (i.e., \(X=(X_0,X_1,…,X_p)=(1,X_1,…,X_p)\)).

So, we have: \[
\begin{eqnarray}
\hat{Y} &=& \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j \\
&=& 1\hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j \\
&=& X_0\hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j \\
&=& \sum_{j=0}^p X_j \hat{\beta}_j \\
\hat{Y} &=& X{\beta}^T \\
\end{eqnarray}
\]

Whenever we have a sum of products of values, we can rewrite the summation and multiplcation as a dot-product. We do that in the last line above. \(X\) becomes a \(p+1\) dimensional row-vector; likewise, \(\beta\) is a \(p+1\) dimensional row-vector (transposed to a column-vector). If you recall your linear algebra, we utilize a “row-against-column” multiplication rule and \((1,p+1)\times(p+1,1)\) yields a \((1,1)\) dimensional result.

More Than One Observation

Now, if you’re like me (computer scientist first, mathematician — lovingly, but — second) you’re interested in what this means from a computational perspective. In this case, I’m not necessarily concerned about what the different equations mean in terms of time-complexity (we’ve got \(O(p)\) additions and multiplications in both equations), but what about space-complexity and code expression?

Now, you’ll notice that we expressed the relationship between the output and the input for one pair of input-output values. What if we want to do that for \(N\) input values? Then we have (HTF, 2.3):

\[RSS(\beta)=\sum_{i=1}^N (y_i-x_{i}\beta^T)^2\]

I’m going to diverge ever so slightly from HTF’s notation and make \(x_i\) a row-vector. So, we drop their \(\cdot^T\) on \(x_i\).

Now, recall what I just said: if we have a sum of products, we can rewrite the calculation as a dot-product. Here, the product is “hidden” in the square (i.e., the \({\cdot}^2\)). So, we can dot-product between \(y_i-x_{i}^T\beta\) and itself … except, we need the subscripted items (\(y\) and \(x\)) to expand in structure to handle the different values that each carries across the summation (i.e., in general, \(x_1 \neq x_2\)). HTF comes up with (2.4):

\[\min RSS(\beta)=(\bf{y}-\bf{X}\beta)^T(\bf{y}-\bf{X}\beta)\]

Now, if you’re like me, you might be a little be grumpy. Even if you like a dynamically typed language like Python, I still like strong typing. Thus, \(X\) going from a \((p+1,1)\) vector \(\rightarrow (p+1,N)\) matrix, sort of annoys me. But, if you look very, very carefully you will notice that \(X \rightarrow \bf{X}\). Yes, \(X\) became bold-faced. Such utterly subtle changes in presentation with large changes in meaning are a hallmark of mathematical chickanery (I mean beauty). It’s the same reason that \(i\) and \(j\) are used for subscripts and that \(r\) and \(n\) are frequently used in binomical coeffients (i.e., \(\binom{n}{r}\)). If you’re confused, just imagine writing those letters repeatedly on the chalkboard. And yes, I’ve seen folks getting soft and writing \(\binom{n}{k}\), but I digress … And please don’t be offended — computer scientists pull the same sort of crap in different ways.

They don’t say it, but the use of matrices enables us to say: \[\bf{\hat{Y}} = \bf{X}\hat{\beta}^T\] Here we have some mathematical beauty. It says, very concisely, that the predicted outputs are a linear function of the inputs and the estimated coefficients. Although, it hides whether or not there is a \(\hat{\beta}_0\) term.

Computational Implications of Notation: Implementation

We have some mathematical ideas sketched out. Let’s take a brief pause (from the mathematics) and turn to some code. We could work with a standard dataset (e.g., a UCI dataset like iris). But, for now, I’m going to go with a simple synthetic example that couples an actual linear relationship with some noise. We’ll start with a single \(X\) (err, \(x_0\)) and then we’ll build up to \(\bf{X}\). To be clear, in the Python code I’m going to use \(\mathtt{x}\) for a single example and \(\mathtt{X}\) for a set of examples. Also, we’ll use the following model: \(Y=f(X)=1+\sum_{i=1}^{3}2^i X_i=1+2 X_1 + 4 X_2 + 8 X_3\) and we’ll restrict the \(X_i\) to come from the range \([0,10)\).

In [2]:
import numpy as np
x=np.random.uniform(0,10,3)
print x
[ 8.64408799  7.97358871  5.18087896]
In [3]:
noise = np.random.normal(0.0, 2.0, 1)
print noise
[ 3.73810917]
In [4]:
def f(x):  
    return 1 + np.dot(x,2**np.arange(1,4))
print f(x)
91.6295624877
In [5]:
y = f(x) + noise
print y
[ 95.36767166]

If you haven’t seen it before, you may be a little bit confused at this point. Why are we creating data out of thin air? And why did we add noise? We’re using synthetic data because it is easy to create, easy to control, and when we want to understand the relationship between the data and the model, we can do some sleuthing relatively easily. As to noise, the idea is that in the “real world” we typically have less than perfect knowledge of the output (math: \(Y\), python \(\mathtt{y}\)} values. We account for this by explicitly adding (controlled) noise. In the real world, the noise is there implicitly.

One last issue. This has to do with the \(\beta_0\) \(X_0\) issue. Mathematically, it is convenient to roll the coefficent into the summation and expand the \(X_0=1\) into the data. We can certainly do that:

In [6]:
x = np.concatenate([np.ones(1), x])
print x
[ 1.          8.64408799  7.97358871  5.18087896]

Voila. Please take a second to look closely at the assignment. We just added three steps to (meet) the mathematical convenience. We (1) created a new array (np.ones), (2) we created a new python list (content inside of []), and (3) we created a new numpy array. That third bit could be disasterous: for the benefit of adding one element, we just made a second allocation of \(O(p)\) memory elements and copies to fill them. Yes, premature optimization is the root of all evil. And yes, \(2p\) is still \(O(p)\). But! We care immensely about going from using half of system memory to using all of system memory. And, I’m showing you the optimization issue here for pedagogical purposes: the same principle applies when we must use the optimization. There’s a moral to this story. We want to do one (and only one) memory allocation if we can help it. There’s a second moral coming in a moment, but we’ll take them one at a time. We can do a single allocation in one of two ways:

In [7]:
x1=np.empty((4,))
x1[0] = 1.0
x1[1:]=np.random.uniform(0,10,3)
print x1
[ 1.          0.48824435  8.03740956  8.83808266]
In [8]:
x2    = np.random.uniform(0,10,4)
x2[0] = 1.0
print x2
[ 1.          9.85313908  5.0465084   0.38967288]

I’ll leave it as an exercise to the reader to discuss the benefits of each of these. As a take home, try not to use concatenation after you’ve moved past the point of sketching out ideas. A side note: np.concatenate sometimes has use when you are working with partial arrays as you build incremental solutions. Then you don’t pay for allocating over your whole data rows or columns (p or N) — particularly when you don’t know those final values. If we’re reading data from a file, this also introduces complexity because we need to account for the “padded” 1.0s in each row as we do the input. Also, remember that we’re going to have a second moral, related to the \(0\)-index coefficient parameter and input value, in a little bit. For now, let’s make a lot of data (i.e., let’s go from x to X in Python).

In [9]:
X = np.random.uniform(0,10,4*1000)
X = X.reshape((1000,4))
X[:,0] = 1.0

You might say, “Ack!” An extra 4000 uniform number generations? Isn’t that expensive? Possibly. So, let’s do:

In [10]:
X       = np.empty((1000,4))
X[:,0]  = 1.0
X[:,1:] = np.random.uniform(0,10,(1000,3))

Well, now, I’m wondering, “Why the heck do we have an extra 1000 \(1.0\) values running around?” They are doing nothing but trying to smooth out our mathematical convenience. If have a gig (10^9 or 2^30, depending on your math/comp preferences: either way, a billion or so) of examples, we’re now taking up a good bit of excess memory:

In [11]:
10**9 * np.array([1.0]).nbytes
Out[11]:
8000000000

Not surprisingly, at 8 bytes per 64-bit float: for a billion of them, it’s 8 billion bytes or 8GB and near 8GiB. Point is, as the data grows, we’re wasting a good bit of space. And, we have to pay to get the 1.0 values there. So, while the explicit 1.0 is useful for the mathematical formalism, it seems like the cost to directly mimic the mathematics in code is not worth the code and time complexity (code complexity means cost in read-, write-, and maintainability).

Calculating RSS

Let’s return to the point of computing the residual sum-of-squares \(RSS\) for a dataset. I’m going to take the advice from above and not put in any explicit \(1.0\)s in the X (Python) matrix.

In [12]:
X     = np.random.uniform(0,10,(1000,3))
noise = np.random.normal(0.0, 2.0, 1000)
Y     = f(X) + noise

print X[0], f(X[0]), Y[0]
[ 6.6557709   6.35524414  6.67009511] 93.0932792752 91.449699275

Remember, this is synthetic data. We know the underlying model (and thus, the \(\beta\) coefficients) used to generate the outputs from the inputs. But in the real-world we would have the inputs and outputs without those coefficients. We would go through some process (that we’ll discuss soon!) to recover some “best guesses” — really, it’s more than that — at the coefficients. For sake of argument, let’s say we came up with two different sets of \(\beta\)‘s: \(\hat{\beta}_a=\) beta_hat_a and \(\hat{\beta}_b=\) beta_hat_b. I’m using a and b, to reduce confusion with the mathematical \(\beta\) coefficients. Incidentally, hat (i.e., \(\hat{\cdot}\)) is used for values that are inferred (functions of) the dataset. “hat”, I guess, sounds better and looks more formal than “best guess given the data”.

Moving along:

In [13]:
beta_hat_a = np.ones(4)               # \betas = 1,1,1,1
beta_hat_b = (np.arange(4)+1.0) * 2   # \betas = 2,4,6,8, force float typing
beta_hat_c = 2**np.arange(4)          # \betas = 1,2,4,8, force float typing
def RSS(x,y,beta):  
    errors = beta[0] + np.dot(x,beta[1:]) - y
    return np.dot(errors, errors)

print "beta_hat_a: %10s RSS: %11.3f"  % (beta_hat_a, RSS(X,Y,beta_hat_a))
print "beta_hat_b: %10s RSS: % 11.3f" % (beta_hat_b, RSS(X,Y,beta_hat_b))
print "beta_hat_c: %10s RSS: % 11.3f" % (beta_hat_b, RSS(X,Y,beta_hat_c))
beta_hat_a: [ 1.  1.  1.  1.] RSS: 3537775.852
beta_hat_b: [ 2.  4.  6.  8.] RSS:  506959.425
beta_hat_c: [ 2.  4.  6.  8.] RSS:    3931.634

Look at what happened: even though I kept the 1.0s out of X, I still included them in my model. I included them by including a beta[0] term when I put together the predicted y value: beta[0] + np.dot(x,beta[1:]). This is the second moral with a few parts: (1) factoring out common data is a powerful technique that can save memory, (2) mathematical beauty and computation simplicity/efficiency can sometimes be at odds, and (3) just because you know for form of the data X doesn’t mean you know the model being applied — you also need to know whether a beta[0] term is showing up in the calculations.

A Note on np.dot and Broadcasting

The guess of all 1.0 coefficients has has quite a bit more error (residue) than the multiples of two. How can we get a “really good” guess at the coefficients? Ah, young grasshoper, that is for another lesson. But, I can give you one other tip. Be careful with np.dot(x,beta).

If beta is a row-vector (a 1-D NumPy array with shape (p,)), than np.dot(beta, x) will work when x is a 1-D array of shape (p,) but it will fail when x is a 2-D array of shape (n,p). However, np.dot(x,beta) will work for both the multiple data [n x p] dot [p] and for the single data [p] dot [p]. The reason is tied up in NumPy’s rules for broadcasting, which we won’t dive into now.

In [14]:
x = np.arange(1.0,5.0)
xs = np.row_stack([x,x])
beta = 1.0 / np.arange(1.0,5.0)

print "with dot(<data>, beta)"
print  x.shape, beta.shape, np.dot( x, beta)
print xs.shape, beta.shape, np.dot(xs, beta)

print "with dot(beta, <data>)"
print beta.shape,  x.shape, np.dot(beta,  x)
try:
    print beta.shape, xs.shape, np.dot(beta, xs)
except ValueError:
    print "failed"
with dot(<data>, beta)
(4,) (4,) 4.0
(2, 4) (4,) [ 4.  4.]
with dot(beta, <data>)
(4,) (4,) 4.0
(4,) (2, 4) failed

Final Words

You might be wondering why I spent so much time working through the mathematical and computational representations, only to point out the one that has some pretty clear advantages (the implicit \(0\)-index constant). The reason is that many (most? all?) issues in expressing mathematics as computations (and translating from mathematics to computation) have similar issues: efficiency and representation trade-offs.

Additional Resources

You can grab a copy of this notebook.

Even better, you can view it using nbviewer.

License

Unless otherwise noted, the contents of this notebook are under the following license. The code in the notebook should be considered part of the text (i.e., licensed and treated as as follows).

Creative Commons License
DrsFenner.org Blog And Notebooks by Mark and Barbara Fenner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at drsfenner.org/blog/about-and-contacts.