Home > Linear Algebra > Understanding matrices intuitively, part 1

Understanding matrices intuitively, part 1

I want to show you a way of picturing and thinking about matrices. The topic for today is the square matrix, which we will call A. I’m going to show you a way of graphing square matrices, although we will have to limit ourselves to the 2 x 2 case. That will be, as they say, without loss of generality. The technique I’m about to show you could be used with 3 x 3 matrices if you had a better 3-dimensional monitor, and as will be revealed, it could be used on 3 x 2 and 2 x 3 matrices, too. If you had more imagination, we could use the technique on 4 x 4, 5 x 5, and even higher-dimensional matrices.

But we will limit ourselves to 2 x 2. A might be

delim{[}{matrix{2}{2}{2 1 1.5 2}}{]}

From now on, I’ll write matrices as

A = (2, 1 \ 1.5, 2)

where commas are used to separate elements on the same row and backslashes are used to separate the rows.

To graph A, I want you to think about

y = Ax


y: 2 x 1,

A: 2 x 2, and

x: 2 x 1.

That is, we are going to think about A in terms of its effect in transforming points in space from x to y. For instance, if we had the point

x = (0.75 \ 0.25)


y = (1.75 \ 1.625)

because by the rules of matrix multiplication y[1] = 0.75*2 + 0.25*1 = 1.75 and y[2] = 0.75*1.5 + 0.25*2 = 1.625. The matrix A transforms the point (0.75 \ 0.25) to (1.75 \ 1.625). We could graph that:

To get a better understanding of how A transforms the space, we could graph additional points:

I do not want you to get lost among the individual points which A could transform, however. To focus better on A, we are going to graph y = Ax for all x. To do that, I’m first going to take a grid,

One at a time, I’m going to take every point on the grid, call the point x, and run it through the transform y = Ax. Then I’m going to graph the transformed points:

Finally, I’m going to superimpose the two graphs:

In this way, I can now see exactly what A = (2, 1 \ 1.5, 2) does. It stretches the space, and skews it.

I want you to think about transforms like A as transforms of the space, not of the individual points. I used a grid above, but I could just as well have used a picture of the Eiffel tower and, pixel by pixel, transformed it by using y = Ax. The result would be a distorted version of the original image, just as the the grid above is a distorted version of the original grid. The distorted image might not be helpful in understanding the Eiffel Tower, but it is helpful in understanding the properties of A. So it is with the grids.

Notice that in the above image there are two small triangles and two small circles. I put a triangle and circle at the bottom left and top left of the original grid, and then again at the corresponding points on the transformed grid. They are there to help you orient the transformed grid relative to the original. They wouldn’t be necessary had I transformed a picture of the Eiffel tower.

I’ve suppressed the scale information in the graph, but the axes make it obvious that we are looking at the first quadrant in the graph above. I could just as well have transformed a wider area.

Regardless of the region graphed, you are supposed to imagine two infinite planes. I will graph the region that makes it easiest to see the point I wish to make, but you must remember that whatever I’m showing you applies to the entire space.

We need first to become familiar with pictures like this, so let’s see some examples. Pure stretching looks like this:

Pure compression looks like this:

Pay attention to the color of the grids. The original grid, I’m showing in red; the transformed grid is shown in blue.

A pure rotation (and stretching) looks like this:

Note the location of the triangle; this space was rotated around the origin.

Here’s an interesting matrix that produces a surprising result: A = (1, 2 \ 3, 1).

This matrix flips the space! Notice the little triangles. In the original grid, the triangle is located at the top left. In the transformed space, the corresponding triangle ends up at the bottom right! A = (1, 2 \ 3, 1) appears to be an innocuous matrix — it does not even have a negative number in it — and yet somehow, it twisted the space horribly.

So now you know what 2 x 2 matrices do. They skew,stretch, compress, rotate, and even flip 2-space. In a like manner, 3 x 3 matrices do the same to 3-space; 4 x 4 matrices, to 4-space; and so on.

Well, you are no doubt thinking, this is all very entertaining. Not really useful, but entertaining.

Okay, tell me what it means for a matrix to be singular. Better yet, I’ll tell you. It means this:

A singular matrix A compresses the space so much that the poor space is squished until it is nothing more than a line. It is because the space is so squished after transformation by y = Ax that one cannot take the resulting y and get back the original x. Several different x values get squished into that same value of y. Actually, an infinite number do, and we don’t know which you started with.

A = (2, 3 \ 2, 3) squished the space down to a line. The matrix A = (0, 0 \ 0, 0) would squish the space down to a point, namely (0 0). In higher dimensions, say, k, singular matrices can squish space into k-1, k-2, …, or 0 dimensions. The number of dimensions is called the rank of the matrix.

Singular matrices are an extreme case of nearly singular matrices, which are the bane of my existence here at StataCorp. Here is what it means for a matrix to be nearly singular:

Nearly singular matrices result in spaces that are heavily but not fully compressed. In nearly singular matrices, the mapping from x to y is still one-to-one, but x‘s that are far away from each other can end up having nearly equal y values. Nearly singular matrices cause finite-precision computers difficulty. Calculating y = Ax is easy enough, but to calculate the reverse transform x = A-1y means taking small differences and blowing them back up, which can be a numeric disaster in the making.

So much for the pictures illustrating that matrices transform and distort space; the message is that they do. This way of thinking can provide intuition and even deep insights. Here’s one:

In the above graph of the fully singular matrix, I chose a matrix that not only squished the space but also skewed the space some. I didn’t have to include the skew. Had I chosen matrix A = (1, 0 \ 0, 0), I could have compressed the space down onto the horizontal axis. And with that, we have a picture of nonsquare matrices. I didn’t really need a 2 x 2 matrix to map 2-space onto one of its axes; a 2 x 1 vector would have been sufficient. The implication is that, in a very deep sense, nonsquare matrices are identical to square matrices with zero rows or columns added to make them square. You might remember that; it will serve you well.

Here’s another insight:

In the linear regression formula b = (XX)-1Xy, (XX)-1 is a square matrix, so we can think of it as transforming space. Let’s try to understand it that way.

Begin by imagining a case where it just turns out that (XX)-1 = I. In such a case, (XX)-1 would have off-diagonal elements equal to zero, and diagonal elements all equal to one. The off-diagonal elements being equal to 0 means that the variables in the data are uncorrelated; the diagonal elements all being equal to 1 means that the sum of each squared variable would equal 1. That would be true if the variables each had mean 0 and variance 1/N. Such data may not be common, but I can imagine them.

If I had data like that, my formula for calculating b would be b = (XX)-1Xy = IXy = Xy. When I first realized that, it surprised me because I would have expected the formula to be something like b = X-1y. I expected that because we are finding a solution to y = Xb, and b = X-1y is an obvious solution. In fact, that’s just what we got, because it turns out that X-1y = Xy when (XX)-1 = I. They are equal because (XX)-1 = I means that XX = I, which means that X‘ = X-1. For this math to work out, we need a suitable definition of inverse for nonsquare matrices. But they do exist, and in fact, everything you need to work it out is right there in front of you.

Anyway, when correlations are zero and variables are appropriately normalized, the linear regression calculation formula reduces to b = Xy. That makes sense to me (now) and yet, it is still a very neat formula. It takes something that is N x k — the data — and makes k coefficients out of it. Xy is the heart of the linear regression formula.

Let’s call b = Xy the naive formula because it is justified only under the assumption that (XX)-1 = I, and real XX inverses are not equal to I. (XX)-1 is a square matrix and, as we have seen, that means it can be interpreted as compressing, expanding, and rotating space. (And even flipping space, although it turns out the positive-definite restriction on XX rules out the flip.) In the formula (XX)-1Xy, (XX)-1 is compressing, expanding, and skewing Xy, the naive regression coefficients. Thus (XX)-1 is the corrective lens that translates the naive coefficients into the coefficient we seek. And that means XX is the distortion caused by scale of the data and correlations of variables.

Thus I am entitled to describe linear regression as follows: I have data (y, X) to which I want to fit y = Xb. The naive calculation is b = Xy, which ignores the scale and correlations of the variables. The distortion caused by the scale and correlations of the variables is XX. To correct for the distortion, I map the naive coefficients through (XX)-1.

Intuition, like beauty, is in the eye of the beholder. When I learned that the variance matrix of the estimated coefficients was equal to s2(XX)-1, I immediately thought: s2 — there’s the statistics. That single statistical value is then parceled out through the corrective lens that accounts for scale and correlation. If I had data that didn’t need correcting, then the standard errors of all the coefficients would be the same and would be identical to the variance of the residuals.

If you go through the derivation of s2(XX)-1, there’s a temptation to think that s2 is merely something factored out from the variance matrix, probably to emphasize the connection between the variance of the residuals and standard errors. One easily loses sight of the fact that s2 is the heart of the matter, just as Xy is the heart of (XX)-1Xy. Obviously, one needs to view both s2 and Xy though the same corrective lens.

I have more to say about this way of thinking about matrices. Look for part 2 in the near future. Update: part 2 of this posting, “Understanding matrices intuitively, part 2, eigenvalues and eigenvectors”, may now be found at http://blog.stata.com/2011/03/09/understanding-matrices-intuitively-part-2/.

  • Anonymous

    Comments are now open.

  • A.J.

    I like how you see the square matrix. Thanks for sharing! I’m happy to comment first on Not Elsewhere Classified.

  • If you’re using Stata to produce the visualizations, will you be posting the code?

  • DSB

    Great stuff. Speaking of invertibility and continuing with your example of A as a 2×2 matrix, it might be useful do demonstrate (in part 2) how we can (geometrically) represent the determinant of a matrix as its volume.

  • Chuck Huber

    Very cool!

  • RH

    Is it me or is something up with the ‘pure rotation’ example?

  • Anonymous

    You are right and I just crossed out the word pure. The result is both rotated and stetched.

  • Anonymous

    Yes, I will do that when Part 2 goes up.

  • wly

    A very interesting post! I was wondering if you could also post your Stata codes that create these graphs. Thanks.

  • Fileas

    great post! keep it coming like that and everybody will be able to develop a comprehensive intuition on what is going on inside stata. thanks

  • Extremely valuable post. Thanks so much for taking the time and creativity to do this.

  • Npandis

    This is great.nI am wondering if the articles could be also viewed into a print version.nThanks

  • OpaqueWaters

    This is fantastic. I’ve been using matrices for a while now and was doing the calculations fine but it was irritating not being able to visualise any of it. Wish they had these kind of explanations in our textbooks, thank you!

  • Megasheep

    This is so much better than a whole semester at uni.¬† Thank you for sharing your insight and please write more. I’ve just became fan of this blog.

  • Pingback: An Intuitive Guide to Linear Algebra | BetterExplained()

  • sandeep

    wow! Extremely informative and valuable article.

  • unconed

    It blows my mind that is not how linear algebra is introduced, I realized the same thing doing computer graphics and now find matrices easy.

    What’s more, you can explain vector-matrix multiplication intuitively this way. All you’re doing is splitting the vector into its coordinates, and mapping them onto the distorted grid, using the rows or columns of the matrix as new basis vectors.

    Matrix-matrix multiplication can then be done by splitting one matrix into its rows or column vectors, and transforming them using this principle.

    The asymmetry of matrix multiplication is explained by doing this process twice, once from the left, using rows as vectors, and once from the right, using columns as vectors. You will find you have written out the same algorithm in two different ways.

    And for the real kicker: you can explain homogeneous coordinates / projective transforms intuitively. Imagine you’re playing an FPS game like Portal and you’re looking around standing at a fixed position. Your brain tells you you are seeing a 3D space rotate around a common origin (i.e. an affine matrix transform). But really, the screen is showing 2D shapes scrolling through your field of view, i.e. translating. So, if you express 2D shapes as 3D shapes floating in front of a camera, you can use 3D matrix transforms to perform 2D translations, and you get perspective projections for free. This principle is applied everywhere in computer graphics, leading to 4D matrices/vectors for 3D engines.

  • Brajesh

    Great article, very valuable, this makes understanding matrices so much easier.

  • Pingback: 28.10.12 « B612()

  • I get everything up until the part about linear regressions. I kind of get it but you’re using terminology that I don’t know. What is X`X? Is that matrix multiplication?

  • BillGould

    X’X is a way of writing X-transpose multiplied by X. I’m using prime to indicate transpose.

  • Let X = a, b c, d then X’ = a, c b, d?

  • BillGould


  • Pingback: The Perron-Frobenius Theorem for Stochastic Matrices | Eventually Almost Everywhere()

  • Siddhartha

    I am unable to view the two circle or two triangles that you pointed out from the graph. Can you tell me how to find them

  • BillGould

    It is hard to see.

    Look at the upper left corner of the red rectangular grid and you’ll see a black triangle.

    Look at the lower left corner of the red rectangular grid and you’ll see a black circle.

    It helps if you increase the size of the image.

  • Siddhartha

    Still very hard to find, i tried the magnifier option.

  • Adrian Druzgalski

    First off, fantastic article. Thank you.

    So I believe there is a tiny error here, but I’m not quite sure…

    In the following comment

    “The off-diagonal elements being equal to 0 means that the variables in the data are uncorrelated; the diagonal elements all being equal to 1 means that the sum of each squared variable would equal 1. That would be true if the variables each had mean 0 and variance 1/N. Such data may not be common, but I can imagine them.”

    So the uncorrelated off-diagonals means that cov(x_i, x_j) = 0 when i != j, which I agree with.

    When i = j we are looking at cov(x_i, x_i) = var(x_i) = 1. This means that the variance is 1 for *each* variable, but the variance of the mean is 1 / N from
    Bienaymé formula.


    Also, how do you know that the mean is 0? I’m not saying it’s not, I just wan’t able to convince myself of that at a glance. Thanks!

  • BillGould

    I am not understanding the variance question. Yes, the variance of x is 1. Ergo, the variance of the mean of x in a sample of size N is 1/N. There is no contradiction. The “variance of the mean” would be better called the variance of the average; we are calculating the variance of (x1+x2+…+xN)/N.

    As for the mean being 0, I just assumed that to save myself some math; there’s nothing you can look at in X’X to see that. In my business, it is not uncommon for us to calculate not X’X, but Z’Z, where z_i = x_i – mean(x_i). My business is statistical software and calculating z’z is a more numerically accurate way to obtain X’X. One could add back in the mean after making the z’z calculation to obtain X’X, but there’s no reason to do that because most formulas you want to calculate are simpler and therefore easier to calculate accurately when you use Z’Z. Anyway, I skipped all that and just said “assume …” so I could make my point, and I just assumed everyone would understand that we could get easily calculate a matrix with these properties and thus I had made my point without loss of generality. Sorry.

  • biostata

    x = (0.75.25) represents an arrow pointing from the origin,(0,0),to the point (0.75,0.25). scatteri `x1′ `x2′ should be scatteri `x2′ `x1′

  • Ani

    By far the best description of matrix multiplication i have seen. You should write a book. Very well presented article thanks.

  • Celina

    I also very much like your post. Having been struggling with visualizing the graphs and finally this is the thing I need. Thanks!

  • arazalan

    I have now understood what is the use of matrices.

  • Jon

    This is a great blog post, you’ve really helped me gain insight into both abstract (matrix multiplication) and applied (linear regression) concepts. Much appreciated!