Posts Tagged ‘numerical analysis’

Using Stata’s random-number generators, part 4, details

For those interested in how pseudo random number generators work, I just wrote something on Statalist which you can see in the Statalist archives by clicking the link even if you do not subscribe:

To remind you, I’ve been writing about how to use random-number generators in parts 1, 2, and 3, and I still have one more posting I want to write on the subject. What I just wrote on Statalist, however, is about how random-number generators work, and I think you will find it interesting.

To find out more about Statalist, see


How to successfully ask a question on Statalist

Using Stata’s random-number generators, part 1

I want to start a series on using Stata’s random-number function. Stata in fact has ten random-number functions:

  1. runiform() generates rectangularly (uniformly) distributed random number over [0,1).
  2. rbeta(a, b) generates beta-distribution beta(a, b) random numbers.
  3. rbinomial(n, p) generates binomial(n, p) random numbers, where n is the number of trials and p the probability of a success.
  4. rchi2(df) generates χ2 with df degrees of freedom random numbers.
  5. rgamma(a, b) generates Γ(a, b) random numbers, where a is the shape parameter and b, the scale parameter.
  6. rhypergeometric(N, K, n) generates hypergeometric random numbers, where N is the population size, K is the number of in the population having the attribute of interest, and n is the sample size.
  7. rnbinomial(n, p) generates negative binomial — the number of failures before the nth success — random numbers, where p is the probability of a success. (n can also be noninteger.)
  8. rnormal(μ, σ) generates Gaussian normal random numbers.
  9. rpoisson(m) generates Poisson(m) random numbers.
  10. rt(df) generates Student’s t(df) random numbers.

You already know that these random-number generators do not really produce random numbers; they produce pseudo-random numbers. This series is not about that, so we’ll be relaxed about calling them random-number generators.

You should already know that you can set the random-number seed before using the generators. That is not required but it is recommended. You set the seed not to obtain better random numbers, but to obtain reproducible random numbers. In fact, setting the seed too often can actually reduce the quality of the random numbers! If you don’t know that, then read help set seed in Stata. I should probably pull out the part about setting the seed too often, expand it, and turn it into a blog entry. Anyway, this series is not about that either.

This series is about the use of random-number generators to solve problems, just as most users usually use them. The series will provide practical advice. I’ll stay away from describing how they work internally, although long-time readers know that I won’t keep the promise. At least I’ll try to make sure that any technical details are things you really need to know. As a result, I probably won’t even get to write once that if this is the kind of thing that interests you, StataCorp would be delighted to have you join our development staff.


runiform(), generating uniformly distributed random numbers

Mostly I’m going to write about runiform() because runiform() can solve such a variety of problems. runiform() can be used to solve,

  • shuffling data (putting observations in random order),
  • drawing random samples without replacement (there’s a minor detail we’ll have to discuss because runiform() itself produces values drawn with replacement),
  • drawing random samples with replacement (which is easier to do than most people realize),
  • drawing stratified random samples (with or without replacement),
  • manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).

runiform() generates uniformly, a.k.a. rectangularly distributed, random numbers over the interval, I quote from the manual, “0 to nearly 1”.

Nearly 1? “Why not all the way to 1?” you should be asking. “And what exactly do you mean by nearly 1?”

The answer is that the generator is more useful if it omits 1 from the interval, and so we shaved just a little off. runiform() produces random numbers over [0, 0.999999999767169356].

Here are two useful formulas you should commit to memory.

  1. If you want to generate continuous random numbers between a and b, use

    generate double u = (ba)*runiform() + a

    The random numbers will not actually be between a and b, they will be between a and nearly b, but the top will be so close to b, namely 0.999999999767169356*b, that it will not matter.

    Remember to store continuous random values as doubles.

  2. If you want to generate integer random numbers between a and b, use

    generate ui = floor((ba+1)*runiform() + a)

    In particular, do not even consider using the formula for continuous values but rounded to integers, which is to say, round(u) = round((ba)*runiform() + a). If you use that formula, and if ba>1, then a and b will be under represented by 50% each in the samples you generate!

    I stored ui as a default float, so I am assuming that -16,777,216 ≤ a < b ≤ 16,777,216. If you have integers outside of that range, however, store as a long or double.

I’m going to spend the rest of this blog entry explaining the above.

First, I want to show you how I got the two formulas and why you must use the second formula for generating integer uniform deviates.

Then I want explain why we shaved a little from the top of runiform(), namely (1) while it wouldn’t matter for formula 1, it made formula 2 a little easier, (2) the code would run more quickly, (3) we could more easily prove that we had implemented the random-number generator correctly, and (4) anyone digging deeper into our random numbers would not be misled into thinking they had more than 32 bits of resolution. That last point will be important in a future blog entry.


Continuous uniforms over [a, b)

runiform() produces random numbers over [0, 1). It therefore obviously follows that (ba)*runiform()+a produces number over [a, b). Substitute 0 for runiform() and the lower limit is obtained. Substitute 1 for runiform() and the upper limit is obtained.

I can tell you that in fact, runiform() produces random numbers over [0, (232-1)/232].

Thus (ba)*runiform()+a produces random numbers over [a, ((232-1)/232)*b].

(232-1)/232) approximately equals 0.999999999767169356 and exactly equals 1.fffffffeX-01 if you will allow me to use %21x format, which Stata understands and which you can understand if you see my previous blog posting on precision.

Thus, if you are concerned about results being in the interval [a, b) rather than [a, b], you can use the formula

generate double u = ((ba)*runiform() + a) / 1.fffffffeX-01

There are seven f’s followed by e in the hexadecimal constant. Alternatively, you could type

generate double u = ((ba)*runiform() + a) * ((2^32-1)/2^32)

but multiplying by 1.fffffffeX-01 is less typing so I’d type that. Actually I wouldn’t type either one; the small difference between values lying in [a, b) or [a, b] is unimportant.


Integer uniforms over [a, b]

Whether we produce real, continuous random numbers over [a, b) or [a, b] may be unimportant, but if we want to draw random integers, the distinction is important.

runiform() produces continuous results over [0, 1).

(ba)*runiform()+a produces continuous results over [a, b).

To produce integer results, we might round continuous results over segments of the number line:

           a    a+.5  a+1  a+1.5  a+2  a+2.5       b-1.5  b-1  b-.5    b
real line  +-----+-----+-----+-----+-----+-----------+-----+-----+-----+
int  line  |<-a->|<---a+1--->|<---a+2--->|           |<---b-1--->|<-b->|

In the diagram above, think of the numbers being produced by the continuous formula u=(ba)*runiform()+a as being arrayed along the real line. Then imagine rounding those values, say by using Stata’s round(u) function. If you rounded in that way, then

  • Values of u between a and a+0.5 will be rounded to a.
  • Values of u between a+0.5 and a+1.5 will be rounded to a+1.
  • Values of u between a+1.5 and a+2.5 will be rounded to a+2.
  • Values of u between b-1.5 and b-0.5 will be rounded to b-1.
  • Values of u between b-0.5 and b-1 will be rounded to b.

Note that the width of the first and last intervals is half that of the other intervals. Given that u follows the rectangular distribution, we thus expect half as many values rounded to a and to b as to a+1 or a+2 or … or b-1.

And indeed, that is exactly what we would see:

. set obs 100000
obs was 0, now 100000

. gen double u = (5-1)*runiform() + 1

. gen i = round(u)

. summarize u i 

    Variable |       Obs        Mean    Std. Dev.       Min        Max
           u |    100000    3.005933    1.156486   1.000012   4.999983
           i |    100000     3.00489    1.225757          1          5

. tabulate i

          i |      Freq.     Percent        Cum.
          1 |     12,525       12.53       12.53
          2 |     24,785       24.79       37.31
          3 |     24,886       24.89       62.20
          4 |     25,284       25.28       87.48
          5 |     12,520       12.52      100.00
      Total |    100,000      100.00

To avoid the problem we need to make the widths of all the intervals equal, and that is what the formula floor((ba+1)*runiform() + a) does.

           a          a+1         a+2                     b-1          b          b+1
real line  +-----+-----+-----+-----+-----------------------+-----+-----+-----+-----+
int  line  |<--- a --->|<-- a+1 -->|                       |<-- b-1 -->|<--- b --->)

Our intervals are of equal width and thus we expect to see roughly the same number of observations in each:

. gen better = floor((5-1+1)*runiform() + 1)

. tabulate better

     better |      Freq.     Percent        Cum.
          1 |     19,808       19.81       19.81
          2 |     20,025       20.02       39.83
          3 |     19,963       19.96       59.80
          4 |     20,051       20.05       79.85
          5 |     20,153       20.15      100.00
      Total |    100,000      100.00

So now you know why we shaved a little off the top when we implemented runiform(); it made the formula

floor((ba+1)*runiform() + a):

easier. Our integer [a, b] formula did not have to concern itself that runiform() would sometimes — rarely — return 1. If runiform() did return the occasional 1, the simple formula above would produce the (correspondingly occasional) b+1.


How Stata calculates continuous random numbers

I’ve said that we shaved a little off the top, but the fact was that it was easier for us to do the shaving than not.

runiform() is based on the KISS random number generator. KISS produces 32-bit integers, meaning integers the range [0, 232-1], or [0, 4,294,967,295]. You might wonder how we converted that range to being continuous over [0, 1).

Start by thinking of the number KISS produces in its binary form:


The corresponding integer is b31*231 + b31*230 + … + b0*20. All we did was insert a binary point out front:

. b31b30b29b28b27b26b25b24b23b22b21b20b19b18b17b16b15b14b13b12b11b10b9b8b7b6b5b4b3b2b1b0

making the real value b31*2-1 + b30*2-2 + … + b0*2-32. Doing that is equivalent to dividing by 2-32, except insertion of the binary point is faster. Nonetheless, if we had wanted runiform() to produce numbers over [0, 1], we could have divided by 232-1.

Anyway, if the KISS random number generator produced 3190625931, which in binary is


we converted that to


which equals 0.74287549 in base 10.

The largest number the KISS random number generator can produce is, of course,


and 0.11111111111111111111111111111111 equals 0.999999999767169356 in base 10. Thus, the runiform() implementation of KISS generates random numbers in the range [0, 0.999999999767169356].

I could have presented all of this mathematically in base 10: KISS produces integers in the range [0, 232-1], and in runiform() we divide by 232 to thus produce continuous numbers over the range [0, (232-1)/232]. I could have said that, but it loses the flavor and intuition of my longer explanation, and it would gloss over the fact that we just inserted the binary point. If I asked you, a base-10 user, to divide 232 by 10, you wouldn’t actually divide in the same way that they would divide by, say 9. Dividing by 9 is work. Dividing by 10 merely requires shifting the decimal point. 232 divided by 10 is obviously 23.2. You may not have realized that modern digital computers, when programmed by “advanced” programmers, follow similar procedures.

Oh gosh, I do get to say it! If this sort of thing interests you, consider a career at StataCorp. We’d love to have you.


Is it important that runiform() values be stored as doubles?

Sometimes it is important. It’s obviously not important when you are generating random integers using floor((ba+1)*runiform() + a) and -16,777,216 ≤ a < b ≤ 16,777,216. Integers in that range fit into a float without rounding.

When creating continuous values, remember that runiform() produces 32 bits. floats store 23 bits and doubles store 52, so if you store the result of runiform() as a float, it will be rounded. Sometimes the rounding matters, and sometimes it does not. Next time, we will discuss drawing random samples without replacement. In that case, the rounding matters. In most other cases, including drawing random samples with replacement — something else for later — the rounding
does not matter. Rather than thinking hard about the issue, I store all my non-integer
random values as doubles.


Tune in for the next episode

Yes, please do tune in for the next episode of everything you need to know about using random-number generators. As I already mentioned, we’ll discuss drawing random samples without replacement. In the third installment, I’m pretty sure we’ll discuss random samples with replacement. After that, I’m a little unsure about the ordering, but I want to discuss oversampling of some groups relative to others and, separately, discuss the manufacturing of fictional data.

Am I forgetting something?

Understanding matrices intuitively, part 2, eigenvalues and eigenvectors

Last time, I showed you a way to graph and to think about matrices. This time, I want to apply the technique to eigenvalues and eigenvectors. The point is to give you a picture that will guide your intuition, just as it was previously.

Before I go on, several people asked after reading part 1 for the code I used to generate the graphs. Here it is, both for part 1 and part 2:

The eigenvectors and eigenvalues of matrix A are defined to be the nonzero x and λ values that solve

Ax = λx

I wrote a lot about Ax in the last post. Just as previously, x is a point in the original, untransformed space and Ax is its transformed value. λ on the right-hand side is a scalar.

Multiplying a point by a scalar moves the point along a line that passes through the origin and the point:

The figure above illustrates y=λx when λ>1. If λ were less than 1, the point would move toward the origin and if λ were also less than 0, the point would pass right by the origin to land on the other side. For any point x, y=λx will be somewhere on the line passing through the origin and x.

Thus Ax = λx means the transformed value Ax lies on a line passing through the origin and the original x. Points that meet that restriction are eigenvectors (or more correctly, as we will see, eigenpoints, a term I just coined), and the corresponding eigenvalues are the λ‘s that record how far the points move along the line.

Actually, if x is a solution to Ax = λx, then so is every other point on the line through 0 and x. That’s easy to see. Assume x is a solution to Ax = λx and substitute cx for x: Acx = λcx. Thus x is not the eigenvector but is merely a point along the eigenvector.

And with that prelude, we are now in a position to interpret Ax = λx fully. Ax = λx finds the lines such that every point on the line, say, x, transformed by Ax moves to being another point on the same line. These lines are thus the natural axes of the transform defined by A.

The equation Ax = λx and the instructions “solve for nonzero x and λ” are deceptive. A more honest way to present the problem would be to transform the equation to polar coordinates. We would have said to find θ and λ such that any point on the line (r, θ) is transformed to (λr, θ). Nonetheless, Ax = λx is how the problem is commonly written.

However we state the problem, here is the picture and solution for A = (2, 1 \ 1, 2)

I used Mata’s eigensystem() function to obtain the eigenvectors and eigenvalues. In the graph, the black and green lines are the eigenvectors.

The first eigenvector is plotted in black. The “eigenvector” I got back from Mata was (0.707 \ 0.707), but that’s just one point on the eigenvector line, the slope of which is 0.707/0.707 = 1, so I graphed the line y = x. The eigenvalue reported by Mata was 3. Thus every point x along the black line moves to three times its distance from the origin when transformed by Ax. I suppressed the origin in the figure, but you can spot it because it is where the black and green lines intersect.

The second eigenvector is plotted in green. The second “eigenvector” I got back from Mata was (-0.707 \ 0.707), so the slope of the eigenvector line is 0.707/(-0.707) = -1. I plotted the line y = –x. The eigenvalue is 1, so the points along the green line do not move at all when transformed by Ax; y=λx and λ=1.

Here’s another example, this time for the matrix A = (1.1, 2 \ 3, 1):

The first “eigenvector” and eigenvalue Mata reported were… Wait! I’m getting tired of quoting the word eigenvector. I’m quoting it because computer software and the mathematical literature call it the eigenvector even though it is just a point along the eigenvector. Actually, what’s being described is not even a vector. A better word would be eigenaxis. Since this posting is pedagogical, I’m going to refer to the computer-reported eigenvector as an eigenpoint along the eigenaxis. When you return to the real world, remember to use the word eigenvector.

The first eigenpoint and eigenvalue that Mata reported were (0.640 \ 0.768) and λ = 3.45. Thus the slope of the eigenaxis is 0.768/0.640 = 1.2, and points along that line — the green line — move to 3.45 times their distance from the origin.

The second eigenpoint and eigenvalue Mata reported were (-0.625 \ 0.781) and λ = -1.4. Thus the slope is -0.781/0.625 = -1.25, and points along that line move to -1.4 times their distance from the origin, which is to say they flip sides and then move out, too. We saw this flipping in my previous posting. You may remember that I put a small circle and triangle at the bottom left and bottom right of the original grid and then let the symbols be transformed by A along with the rest of space. We saw an example like this one, where the triangle moved from the top-left of the original space to the bottom-right of the transformed space. The space was flipped in one of its dimensions. Eigenvalues save us from having to look at pictures with circles and triangles; when a dimension of the space flips, the corresponding eigenvalue is negative.

We examined near singularity last time. Let’s look again, and this time add the eigenaxes:

The blue blob going from bottom-left to top-right is both the compressed space and the first eigenaxis. The second eigenaxis is shown in green.

Mata reported the first eigenpoint as (0.789 \ 0.614) and the second as (-0.460 \ 0.888). Corresponding eigenvalues were reported as 2.78 and 0.07. I should mention that zero eigenvalues indicate singular matrices and small eigenvalues indicate nearly singular matrices. Actually, eigenvalues also reflect the scale of the matrix. A matrix that compresses the space will have all of its eigenvalues be small, and that is not an indication of near singularity. To detect near singularity, one should look at the ratio of the largest to the smallest eigenvalue, which in this case is 0.07/2.78 = 0.03.

Despite appearances, computers do not find 0.03 to be small and thus do not think of this matrix as being nearly singular. This matrix gives computers no problem; Mata can calculate the inverse of this without losing even one binary digit. I mention this and show you the picture so that you will have a better appreciation of just how squished the space can become before computers start complaining.

When do well-programmed computers complain? Say you have a matrix A and make the above graph, but you make it really big — 3 miles by 3 miles. Lay your graph out on the ground and hike out to the middle of it. Now get down on your knees and get out your ruler. Measure the spread of the compressed space at its widest part. Is it an inch? That’s not a problem. One inch is roughly 5*10-6 of the original space (that is, 1 inch by 3 miles wide). If that were a problem, users would complain. It is not problematic until we get around 10-8 of the original area. Figure about 0.002 inches.

There’s more I could say about eigenvalues and eigenvectors. I could mention that rotation matrices have no eigenvectors and eigenvalues, or at least no real ones. A rotation matrix rotates the space, and thus there are no transformed points that are along their original line through the origin. I could mention that one can rebuild the original matrix from its eigenvectors and eigenvalues, and from that, one can generalize powers to matrix powers. It turns out that A-1 has the same eigenvectors as A; its eigenvalues are λ-1 of the original’s. Matrix AA also has the same eigenvectors as A; its eigenvalues are λ2. Ergo, Ap can be formed by transforming the eigenvalues, and it turns out that, indeed, A½ really does, when multiplied by itself, produce A.

Understanding matrices intuitively, part 1

I want to show you a way of picturing and thinking about matrices. The topic for today is the square matrix, which we will call A. I’m going to show you a way of graphing square matrices, although we will have to limit ourselves to the 2 x 2 case. That will be, as they say, without loss of generality. The technique I’m about to show you could be used with 3 x 3 matrices if you had a better 3-dimensional monitor, and as will be revealed, it could be used on 3 x 2 and 2 x 3 matrices, too. If you had more imagination, we could use the technique on 4 x 4, 5 x 5, and even higher-dimensional matrices.

But we will limit ourselves to 2 x 2. A might be

delim{[}{matrix{2}{2}{2 1 1.5 2}}{]}

From now on, I’ll write matrices as

A = (2, 1 \ 1.5, 2)

where commas are used to separate elements on the same row and backslashes are used to separate the rows.

To graph A, I want you to think about

y = Ax


y: 2 x 1,

A: 2 x 2, and

x: 2 x 1.

That is, we are going to think about A in terms of its effect in transforming points in space from x to y. For instance, if we had the point

x = (0.75 \ 0.25)


y = (1.75 \ 1.625)

because by the rules of matrix multiplication y[1] = 0.75*2 + 0.25*1 = 1.75 and y[2] = 0.75*1.5 + 0.25*2 = 1.625. The matrix A transforms the point (0.75 \ 0.25) to (1.75 \ 1.625). We could graph that:

To get a better understanding of how A transforms the space, we could graph additional points:

I do not want you to get lost among the individual points which A could transform, however. To focus better on A, we are going to graph y = Ax for all x. To do that, I’m first going to take a grid,

One at a time, I’m going to take every point on the grid, call the point x, and run it through the transform y = Ax. Then I’m going to graph the transformed points:

Finally, I’m going to superimpose the two graphs:

In this way, I can now see exactly what A = (2, 1 \ 1.5, 2) does. It stretches the space, and skews it.

I want you to think about transforms like A as transforms of the space, not of the individual points. I used a grid above, but I could just as well have used a picture of the Eiffel tower and, pixel by pixel, transformed it by using y = Ax. The result would be a distorted version of the original image, just as the the grid above is a distorted version of the original grid. The distorted image might not be helpful in understanding the Eiffel Tower, but it is helpful in understanding the properties of A. So it is with the grids.

Notice that in the above image there are two small triangles and two small circles. I put a triangle and circle at the bottom left and top left of the original grid, and then again at the corresponding points on the transformed grid. They are there to help you orient the transformed grid relative to the original. They wouldn’t be necessary had I transformed a picture of the Eiffel tower.

I’ve suppressed the scale information in the graph, but the axes make it obvious that we are looking at the first quadrant in the graph above. I could just as well have transformed a wider area.

Regardless of the region graphed, you are supposed to imagine two infinite planes. I will graph the region that makes it easiest to see the point I wish to make, but you must remember that whatever I’m showing you applies to the entire space.

We need first to become familiar with pictures like this, so let’s see some examples. Pure stretching looks like this:

Pure compression looks like this:

Pay attention to the color of the grids. The original grid, I’m showing in red; the transformed grid is shown in blue.

A pure rotation (and stretching) looks like this:

Note the location of the triangle; this space was rotated around the origin.

Here’s an interesting matrix that produces a surprising result: A = (1, 2 \ 3, 1).

This matrix flips the space! Notice the little triangles. In the original grid, the triangle is located at the top left. In the transformed space, the corresponding triangle ends up at the bottom right! A = (1, 2 \ 3, 1) appears to be an innocuous matrix — it does not even have a negative number in it — and yet somehow, it twisted the space horribly.

So now you know what 2 x 2 matrices do. They skew,stretch, compress, rotate, and even flip 2-space. In a like manner, 3 x 3 matrices do the same to 3-space; 4 x 4 matrices, to 4-space; and so on.

Well, you are no doubt thinking, this is all very entertaining. Not really useful, but entertaining.

Okay, tell me what it means for a matrix to be singular. Better yet, I’ll tell you. It means this:

A singular matrix A compresses the space so much that the poor space is squished until it is nothing more than a line. It is because the space is so squished after transformation by y = Ax that one cannot take the resulting y and get back the original x. Several different x values get squished into that same value of y. Actually, an infinite number do, and we don’t know which you started with.

A = (2, 3 \ 2, 3) squished the space down to a line. The matrix A = (0, 0 \ 0, 0) would squish the space down to a point, namely (0 0). In higher dimensions, say, k, singular matrices can squish space into k-1, k-2, …, or 0 dimensions. The number of dimensions is called the rank of the matrix.

Singular matrices are an extreme case of nearly singular matrices, which are the bane of my existence here at StataCorp. Here is what it means for a matrix to be nearly singular:

Nearly singular matrices result in spaces that are heavily but not fully compressed. In nearly singular matrices, the mapping from x to y is still one-to-one, but x‘s that are far away from each other can end up having nearly equal y values. Nearly singular matrices cause finite-precision computers difficulty. Calculating y = Ax is easy enough, but to calculate the reverse transform x = A-1y means taking small differences and blowing them back up, which can be a numeric disaster in the making.

So much for the pictures illustrating that matrices transform and distort space; the message is that they do. This way of thinking can provide intuition and even deep insights. Here’s one:

In the above graph of the fully singular matrix, I chose a matrix that not only squished the space but also skewed the space some. I didn’t have to include the skew. Had I chosen matrix A = (1, 0 \ 0, 0), I could have compressed the space down onto the horizontal axis. And with that, we have a picture of nonsquare matrices. I didn’t really need a 2 x 2 matrix to map 2-space onto one of its axes; a 2 x 1 vector would have been sufficient. The implication is that, in a very deep sense, nonsquare matrices are identical to square matrices with zero rows or columns added to make them square. You might remember that; it will serve you well.

Here’s another insight:

In the linear regression formula b = (XX)-1Xy, (XX)-1 is a square matrix, so we can think of it as transforming space. Let’s try to understand it that way.

Begin by imagining a case where it just turns out that (XX)-1 = I. In such a case, (XX)-1 would have off-diagonal elements equal to zero, and diagonal elements all equal to one. The off-diagonal elements being equal to 0 means that the variables in the data are uncorrelated; the diagonal elements all being equal to 1 means that the sum of each squared variable would equal 1. That would be true if the variables each had mean 0 and variance 1/N. Such data may not be common, but I can imagine them.

If I had data like that, my formula for calculating b would be b = (XX)-1Xy = IXy = Xy. When I first realized that, it surprised me because I would have expected the formula to be something like b = X-1y. I expected that because we are finding a solution to y = Xb, and b = X-1y is an obvious solution. In fact, that’s just what we got, because it turns out that X-1y = Xy when (XX)-1 = I. They are equal because (XX)-1 = I means that XX = I, which means that X‘ = X-1. For this math to work out, we need a suitable definition of inverse for nonsquare matrices. But they do exist, and in fact, everything you need to work it out is right there in front of you.

Anyway, when correlations are zero and variables are appropriately normalized, the linear regression calculation formula reduces to b = Xy. That makes sense to me (now) and yet, it is still a very neat formula. It takes something that is N x k — the data — and makes k coefficients out of it. Xy is the heart of the linear regression formula.

Let’s call b = Xy the naive formula because it is justified only under the assumption that (XX)-1 = I, and real XX inverses are not equal to I. (XX)-1 is a square matrix and, as we have seen, that means it can be interpreted as compressing, expanding, and rotating space. (And even flipping space, although it turns out the positive-definite restriction on XX rules out the flip.) In the formula (XX)-1Xy, (XX)-1 is compressing, expanding, and skewing Xy, the naive regression coefficients. Thus (XX)-1 is the corrective lens that translates the naive coefficients into the coefficient we seek. And that means XX is the distortion caused by scale of the data and correlations of variables.

Thus I am entitled to describe linear regression as follows: I have data (y, X) to which I want to fit y = Xb. The naive calculation is b = Xy, which ignores the scale and correlations of the variables. The distortion caused by the scale and correlations of the variables is XX. To correct for the distortion, I map the naive coefficients through (XX)-1.

Intuition, like beauty, is in the eye of the beholder. When I learned that the variance matrix of the estimated coefficients was equal to s2(XX)-1, I immediately thought: s2 — there’s the statistics. That single statistical value is then parceled out through the corrective lens that accounts for scale and correlation. If I had data that didn’t need correcting, then the standard errors of all the coefficients would be the same and would be identical to the variance of the residuals.

If you go through the derivation of s2(XX)-1, there’s a temptation to think that s2 is merely something factored out from the variance matrix, probably to emphasize the connection between the variance of the residuals and standard errors. One easily loses sight of the fact that s2 is the heart of the matter, just as Xy is the heart of (XX)-1Xy. Obviously, one needs to view both s2 and Xy though the same corrective lens.

I have more to say about this way of thinking about matrices. Look for part 2 in the near future. Update: part 2 of this posting, “Understanding matrices intuitively, part 2, eigenvalues and eigenvectors”, may now be found at