Author Archive

Use poisson rather than regress; tell a friend

Do you ever fit regressions of the form

ln(yj) = b0 + b1x1j + b2x2j + … + bkxkj + εj

by typing

. generate lny = ln(y)

. regress lny x1 x2 … xk

The above is just an ordinary linear regression except that ln(y) appears on the left-hand side in place of y.

The next time you need to fit such a model, rather than fitting a regression on ln(y), consider typing

. poisson y x1 x2 … xk, vce(robust)

which is to say, fit instead a model of the form

yj = exp(b0 + b1x1j + b2x2j + … + bkxkj + εj)

Wait, you are probably thinking. Poisson regression assumes the variance is equal to the mean,

E(yj) = Var(yj) = exp(b0 + b1x1j + b2x2j + … + bkxkj)

whereas linear regression merely assumes E(ln(yj)) = b0 + b1x1j + b2x2j + … + bkxkj and places no constraint on the variance. Actually regression does assume the variance is constant but since we are working the logs, that amounts to assuming that Var(yj) is proportional to yj, which is reasonable in many cases and can be relaxed if you specify vce(robust).

In any case, in a Poisson process, the mean is equal to the variance. If your goal is to fit something like a Mincer earnings model,

ln(incomej) = b0 + b1*educationj + b2*experiencej + b3*experiencej2 + εj

there is simply no reason to think that the the variance of the log of income is equal to its mean. If a person has an expected income of $45,000, there is no reason to think that the variance around that mean is 45,000, which is to say, the standard deviation is$212.13. Indeed, it would be absurd to think one could predict income so accurately based solely on years of schooling and job experience.

Nonetheless, I suggest you fit this model using Poisson regression rather than linear regression. It turns out that the estimated coefficients of the maximum-likelihood Poisson estimator in no way depend on the assumption that E(yj) = Var(yj), so even if the assumption is violated, the estimates of the coefficients b0, b1, …, bk are unaffected. In the maximum-likelihood estimator for Poisson, what does depend on the assumption that E(yj) = Var(yj) are the estimated standard errors of the coefficients b0, b1, …, bk. If the E(yj) = Var(yj) assumption is violated, the reported standard errors are useless. I did not suggest, however, that you type

. poisson y x1 x2 … xk

I suggested that you type

. poisson y x1 x2 … xk, vce(robust)

That is, I suggested that you specify that the variance-covariance matrix of the estimates (of which the standard errors are the square root of the diagonal) be estimated using the Huber/White/Sandwich linearized estimator. That estimator of the variance-covariance matrix does not assume E(yj) = Var(yj), nor does it even require that Var(yj) be constant across j. Thus, Poisson regression with the Huber/White/Sandwich linearized estimator of variance is a permissible alternative to log linear regression — which I am about to show you — and then I’m going to tell you why it’s better.

I have created simulated data in which

yj = exp(8.5172 + 0.06*educj + 0.1*expj – 0.002*expj2 + εj)

where εj is distributed normal with mean 0 and variance 1.083 (standard deviation 1.041). Here’s the result of estimation using regress:


. regress lny educ exp exp2

Source |       SS       df       MS              Number of obs =    5000
-------------+------------------------------           F(  3,  4996) =   44.72
Model |  141.437342     3  47.1457806           Prob > F      =  0.0000
Residual |  5267.33405  4996  1.05431026           R-squared     =  0.0261
-------------+------------------------------           Adj R-squared =  0.0256
Total |  5408.77139  4999  1.08197067           Root MSE      =  1.0268

------------------------------------------------------------------------------
lny |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .0716126   .0099511     7.20   0.000      .052104    .0911212
exp |   .1091811   .0129334     8.44   0.000     .0838261    .1345362
exp2 |  -.0022044   .0002893    -7.62   0.000    -.0027716   -.0016373
_cons |   8.272475   .1855614    44.58   0.000     7.908693    8.636257
------------------------------------------------------------------------------


I intentionally created these data to produce a low R-squared.

We obtained the following results:


truth      est.    S.E.
----------------------------------
educ      0.0600    0.0716  0.0100
exp       0.1000    0.1092  0.0129
exp2     -0.0020   -0.0022  0.0003
-----------------------------------
_cons     8.5172    8.2725  0.1856   <- unadjusted (1)
9.0587    8.7959     ?     <-   adjusted (2)
-----------------------------------
(1) To be used for predicting E(ln(yj))
(2) To be used for predicting E(yj)


Note that the estimated coefficients are quite close to the true values. Ordinarily, we would not know the true values, except I created this artificial dataset and those are the values I used.

For the intercept, I list two values, so I need to explain. We estimated a linear regression of the form,

ln(yj) = b0 + Xjb + εj

As with all linear regressions,


E(ln(yj)) = E(b0 + Xjb + εj)
= b0 + Xjb + E(εj)
= b0 + Xjb


We, however, have no real interest in E(ln(yj)). We fit this log regression as a way of obtaining estimates of our real model, namely

yj = exp(b0 + Xjb + εj)

So rather than taking the expectation of ln(yj), lets take the expectation of yj:


E(yj) = E(exp(b0 + Xjb + εj))
= E(exp(b0 + Xjb) * exp(εj))
= exp(b0 + Xjb) * E(exp(εj))


E(exp(εj)) is not one. E(exp(εj)) for εj distributed N(0, σ2) is exp(σ2/2). We thus obtain

E(yj) = exp(b0 + Xjb) * exp(σ2/2)

People who fit log regressions know about this — or should — and know that to obtain predicted yj values, they must

1. Obtain predicted values for ln(yj) = b0 + Xjb.

2. Exponentiate the predicted log values.

3. Multiply those exponentiated values by exp(σ2/2), where σ2 is the square of the root-mean-square-error (RMSE) of the regression.

They do in this in Stata by typing

. predict yhat

. replace yhat = exp(yhat).

. replace yhat = yhat*exp(e(rmse)^2/2)

In the table I that just showed you,


truth      est.    S.E.
----------------------------------
educ      0.0600    0.0716  0.0100
exp       0.1000    0.1092  0.0129
exp2     -0.0020   -0.0022  0.0003
-----------------------------------
_cons     8.5172    8.2725  0.1856   <- unadjusted (1)
9.0587    8.7959     ?     <-   adjusted (2)
-----------------------------------
(1) To be used for predicting E(ln(yj))
(2) To be used for predicting E(yj)


I’m setting us up to compare these estimates with those produced by poisson. When we estimate using poisson, we will not need to take logs because the Poisson model is stated in terms of yj, not ln(yj). In prepartion for that, I have included two lines for the intercept — 8.5172, which is the intercept reported by regress and is the one appropriate for making predictions of ln(y) — and 9.0587, an intercept appropriate for making predictions of y and equal to 8.5172 plus σ2/2. Poisson regression will estimate the 9.0587 result because Poisson is stated in terms of y rather than ln(y).

I placed a question mark in the column for the standard error of the adjusted intercept because, to calculate that, I would need to know the standard error of the estimated RMSE, and regress does not calculate that.

Let’s now look at the results that poisson with option vce(robust) reports. We must not forget to specify option vce(robust) because otherwise, in this model that violates the Poisson assumption that E(yj) = Var(yj), we would obtain incorrect standard errors.


. poisson y educ exp exp2, vce(robust)
note: you are responsible for interpretation of noncount dep. variable

Iteration 0:   log pseudolikelihood = -1.484e+08
Iteration 1:   log pseudolikelihood = -1.484e+08
Iteration 2:   log pseudolikelihood = -1.484e+08

Poisson regression                                Number of obs   =       5000
Wald chi2(3)    =      67.52
Prob > chi2     =     0.0000
Log pseudolikelihood = -1.484e+08                 Pseudo R2       =     0.0183

------------------------------------------------------------------------------
|               Robust
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .0575636   .0127996     4.50   0.000     .0324769    .0826504
exp |   .1074603   .0163766     6.56   0.000     .0753628    .1395578
exp2 |  -.0022204   .0003604    -6.16   0.000    -.0029267   -.0015141
_cons |   9.016428   .2359002    38.22   0.000     8.554072    9.478784
------------------------------------------------------------------------------


So now we can fill in the rest of our table:


regress            poisson
truth      est.    S.E.      est.     S.E.
-----------------------------------------------------
educ      0.0600    0.0716  0.0100     0.0576  0.1280
exp       0.1000    0.1092  0.0129     0.1075  0.0164
exp2     -0.0020   -0.0022  0.0003    -0.0022  0.0003
------------------------------------------------------
_cons     8.5172    8.2725  0.1856          ?       ?   <- (1)
9.0587    8.7959       ?     9.0164  0.2359   <- (2)
------------------------------------------------------
(1) To be used for predicting E(ln(yj))
(2) To be used for predicting E(yj)


I told you that Poisson works, and in this case, it works well. I’ll now tell you that in all cases it works well, and it works better than log regression. You want to think about Poisson regression with the vce(robust) option as a better alternative to log regression.

How is Poisson better?

First off, Poisson handles outcomes that are zero. Log regression does not because ln(0) is -∞. You want to be careful about what it means to handle zeros, however. Poisson handles zeros that arise in correspondence to the model. In the Poisson model, everybody participates in the yj = exp(b0 + Xjb + εj) process. Poisson regression does not handle cases where some participate and others do not, and among those who do not, had they participated, would likely produce an outcome greater than zero. I would never suggest using Poisson regression to handle zeros in an earned income model because those that earned zero simply didn’t participate in the labor force. Had they participated, their earnings might have been low, but certainly they would have been greater than zero. Log linear regression does not handle that problem, either.

Natural zeros do arise in other situations, however, and a popular question on Statalist is whether one should recode those natural zeros as 0.01, 0.0001, or 0.0000001 to avoid the missing values when using log linear regression. The answer is that you should not recode at all; you should use Poisson regression with vce(robust).

Secondly, small nonzero values, however they arise, can be influential in log-linear regressions. 0.01, 0.0001, 0.0000001, and 0 may be close to each other, but in the logs they are -4.61, -9.21, -16.12, and -∞ and thus not close at all. Pretending that the values are close would be the same as pretending that that exp(4.61)=100, exp(9.21)=9,997, exp(16.12)=10,019,062, and exp(∞)=∞ are close to each other. Poisson regression understands that 0.01, 0.0001, 0.0000001, and 0 are indeed nearly equal.

Thirdly, when estimating with Poisson, you do not have to remember to apply the exp(σ2/2) multiplicative adjustment to transform results from ln(y) to y. I wrote earlier that people who fit log regressions of course remember to apply the adjustment, but the sad fact is that they do not.

Finally, I would like to tell you that everyone who estimates log models knows about the Poisson-regression alternative and it is only you who have been out to lunch. You, however, are in esteemed company. At the recent Stata Conference in Chicago, I asked a group of knowledgeable researchers a loaded question, to which the right answer was Poisson regression with option vce(robust), but they mostly got it wrong.

I said to them, “I have a process for which it is perfectly reasonable to assume that the mean of yj is given by exp(b0 + Xjb), but I have no reason to believe that E(yj) = Var(yj), which is to say, no reason to suspect that the process is Poisson. How would you suggest I estimate the model?” Certainly not using Poisson, they replied. Social scientists suggested I use log regression. Biostatisticians and health researchers suggested I use negative binomial regression even when I objected that the process was not the gamma mixture of Poissons that negative binomial regression assumes. “What else can you do?” they said and shrugged their collective shoulders. And of course, they just assumed over dispersion.

Based on those answers, I was ready to write this blog entry, but it turned out differently than I expected. I was going to slam negative binomial regression. Negative binomial regression makes assumptions about the variance, assumptions different from that made by Poisson, but assumptions nonetheless, and unlike the assumption made in Poisson, those assumptions do appear in the first-order conditions that determine the fitted coefficients that negative binomial regression reports. Not only would negative binomial’s standard errors be wrong — which vce(robust) could fix — but the coefficients would be biased, too, and vce(robust) would not fix that. I planned to run simulations showing this.

When I ran the simulations, I was surprised by the results. The negative binomial estimator (Stata’s nbreg) was remarkably robust to violations in variance assumptions as long as the data were overdispersed. In fact, negative binomial regression did about as well as Poisson regression. I did not run enough simulations to make generalizations, and theory tells me those generalizations have to favor Poisson, but the simulations suggested that if Poisson does do better, it’s not in the first four decimal places. I was impressed. And disappointed. It would have been a dynamite blog entry.

So you’ll have to content yourself with this one.

Others have preceeded me in the knowledge that Poisson regression with vce(robust) is a better alternative to log-linear regression. I direct you to Jeffery Wooldridge, Econometric Analysis of Cross Section and Panel Data, 2nd ed., chapter 18. Or see A. Colin Cameron and Pravin K. Trivedi, Microeconomics Using Stata, revised edition, chapter 17.3.2.

I first learned about this from a talk given by Austin Nichols, Regression for nonnegative skewed dependent variables, given in 2010 at the Stata Conference in Boston. That talk goes far beyond what I have presented here, and I heartily recommend it.

Categories: Statistics Tags:

Precision (yet again), Part II

In part I, I wrote about precision issues in English. If you enjoyed that, you may want to stop reading now, because I’m about to go into the technical details. Actually, these details are pretty interesting.

For instance, I offered the following formula for calculating error due to float precision:

maximum_error = 2-24 X

I later mentioned that the formula is an approximation, and said that the true formula is,

maximum_error = 2-24 2floor(log2 X)

I didn’t explain how I got either formula.

I need to be more precise today than I was in my previous posting. For instance, I previously used x for two concepts, the true value and the rounded-after-storage value. Today I need to distinguish those concepts.

X is the true value.

x is the value after rounding due to storage.

The issue is the difference between x and X when X is stored in 24-binary-digit float precision.

Base 10

Although I harp on the value of learning to think in binary and hexadecimal, I admit that I, too, find it easier to think in base 10. So let’s start that way.

Say we record numbers to two digits of accuracy, which I will call d=2. Examples of d=2 numbers include

 1.0
1.6
12
47
52*10^1 (i.e, 520, but with only two significant digits)


To say that we record numbers to two digits of accuracy is to say that, coming upon the recorded number 1, we know only that the number lies between 0.95 and 1.05; or coming upon 12, that the true number lies between 11.5 and 12.5, and so on. I assume that numbers are rounded efficiently, which is to say, stored values record midpoints of intervals.

Before we get into the math, let me note that most us would be willing to say that numbers recorded this way are accurate to 1 part in 10 or, if d=3, to 1 part in 100. If numbers are accurate to 1 part in 10^(d-1), then couldn’t we must multiply the number by 1/(10^(d-1)) to obtain the width of the interval? Let’s try:

Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, however, is (515, 525] and it has width 10. So the simple formula does not work.

The simple formula does not work yet I presented its base-2 equivalent in Part 1 and I even recommended its use! We will get to that. It turns out the smaller the base, the more accurately the simple formula approximates the true formula, but before I can show that, I need the true formula.

Let’s start by thinking about d=1.

1. The recorded number 0 will contain all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.

2. The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.

The derivation for the width of interval goes like this:

1. If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.

2. It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.

3. If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.

4. Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).

The above formula is exact.

Base 2

Converting the formula

interval_width = 10floor(log10X)-(d-1)

from base 10 to base 2 is easy enough:

interval_width = 2floor(log2X)-(d-1)

In Part 1, I presented this formula for d=24 as

maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)

In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.

I started out this posting by considering and dismissing the base-10 approximation formula

interval_width = 10-(d-1) X

which in maximum-error units is

maximum_error = 10-d X

and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,

maximum_error = 2-24 X

It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,

maximum_error = 2floor(log2X)-d

can be written

maximum_error = 2-d 2floor(log2X

so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.

X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.

In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.

Categories: Numerical Analysis Tags:

Precision (yet again), Part I

I wrote about precision here and here, but they were pretty technical.

“Great,” coworkers inside StataCorp said to me, “but couldn’t you explain these issues in a way that doesn’t get lost in the details of how computers store binary and maybe, just maybe, write about floats and doubles from a user’s perspective instead of programmer’s perspective?”

“Mmmm,” I said clearly.

Later, when I tried, I liked the result. It contains new material, too. What follows is what I now wish I had written first. I’d would have still written the other two postings, but as technical appendices.

In Part 2 (forthcoming), I provide the mathematical derivations underlying what follows. There are a few interesting issues underlying what follows.

Please excuse the manualish style of what follows, but I suspect that what follows will eventually work its way into Stata’s help files or manuals, so I wrote it that way.

Syntax

Problem:

. generate x = 1.1

. list
(Stata displays output showing x is 1.1 in all observations)

. count if x==1.1
0


Solution 1:

. count if x==float(1.1)
100


Solution 2:

. generate double x = 1.1

. count if x==1.1
100


Solution 3:

. set type double

. generate x = 1.1

. count if x==1.1
100


Description

Stata works in binary. Stata stores data in float precision by default. Stata preforms all calculations in double precision. Sometimes the combination results in surprises until you think more carefully about what happened.

Remarks

Remarks are presented under the headings

Summary
Why count==1.1 produces 0
How count==float(1.1) solves the problem
How storing data as double appears to solve the problem (and does)
Float is plenty accurate to store most data
Why don’t I have the problems using Excel?

Summary

Justifications for all statements made appear in the sections below. In summary,

1. It sometimes appears that Stata is inaccurate. That is not true and, in fact, the appearance of inaccuracy happens in part because Stata is so accurate.

2. You can cover up this appearance of inaccuracy by storing all your data in double precision. This will double or more the size of your dataset, and so I do not recommend the double-precision solution unless your dataset is small relative to the amount of memory on your computer. In that case, there is nothing wrong with storing all your data in double precision.

The easiest way to implement the double-precision solution is by typing set type double. After that, Stata will default to to creating all new variables as doubles, at least for the remainder of the session. If all your datasets are small relative to the amount of memory on your computer, you can set type double, permanently.

3. The double-precision solution is needlessly wasteful of memory. It is difficult to imagine data that are accurate to more than float precision. Regardless of how your data are stored, Stata does all calculations in double precision, and sometimes in quad precision.

4. The issue of 1.1 not being equal to 1.1 arises only with “nice” decimal numbers. You just have to remember to use Stata’s float() function when dealing with such numbers.

Why count x==1.1 produces 0

Let’s trace through what happens when you type the commands

. generate x = 1.1

. count if x==1.1
0


Here is how it works:

1. Some numbers have no exact finite-digit binary representation just as some numbers have no exact finite-digit decimal representation. One-third, 0.3333… (base 10), is an example of a number with no exact finite-digit decimal representation. In base 12, one-third does have an exact finite-digit representation, namely 0.4 (base 12). In base 2 (binary), base 10 numbers such as 0.1, 0.2, 0.3, 0.4, 0.6, … have no exact finite-digit representation.

2. Computers store numbers with a finite number of binary digits. In float precision, numbers have 24 binary digits. In double precision, they have 53 binary digits.

The decimal number 1.1 in binary is 1.000110011001… (base 2). The 1001 on the end repeats forever. Thus, 1.1 (base 10) is stored by a computer as

 1.00011001100110011001101


in float, or as

 1.0001100110011001100110011001100110011001100110011010


in double. There are 24 and 53 digits in the numbers above.

3. Typing generate x = 1.1 results in 1.1 being interpreted as the longer binary number Stata performs all calculations in double precision. New variable x is created as a float by default. When the more precise number is stored in x, it is rounded to the shorter number.

4. Thus when you count if x==1.1 the result is 0 because 1.1 is again interpreted as the longer binary number and the longer number is compared to shorter number stored in x, and they are not equal.

How count x==float(1.1) solves the problem

One way to fix the problem is to change count if x==1.1 to read count if x==float(1.1):

. generate x = 1.1

. count if x==float(1.1)
100


Function float() rounds results to float precision. When you type float(1.1), the 1.1 is converted to binary, double precision, namely,

 1.0001100110011001100110011001100110011001100110011010 (base 2)


and float() then rounds that long binary number to

 1.00011001100110011001101 (base 2)


or more correctly, to

 1.0001100110011001100110000000000000000000000000000000 (base 2)


because the number is still stored in double precision. Regardless, this new value is equal to the value stored in x, and so count reports that 100 observations contain float(1.1).

As an aside, when you typed generate x = 1.1, Stata acted as if you typed generate x = float(1.1). Whenever you type generate x = … and x is a float, Stata acts if if you typed generate x = float(…).

How storing data as double appears to solve the problem (and does)

When you type

. generate double x = 1.1

. count if x==1.1
100


it should be pretty obvious how the problem was solved. Stata stores

1.0001100110011001100110011001100110011001100110011010 (base 2)


in x, and then compares the stored result to

1.0001100110011001100110011001100110011001100110011010 (base 2)


and of course they are equal.

In the Summary above, I referred to this as a cover up. It is a cover up because 1.1 (base 10) is not what is stored in x. What is stored in x is the binary number just shown, and to be equal to 1.1 (base 10), the binary number needs to suffixed with 1001, and then another 1001, and then another, and so on without end.

Stata tells you that x is equal to 1.1 because Stata converted the 1.1 in count to the same inexact binary representation as Stata previously stored in x, and those two values are equal, but neither is equal to 1.1 (base 10). This leads to an important property of digital computers:

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

That is, it appears to you as if there is no inaccuracy in storing 1.1 in x when x is a double because Stata performs calculations in double. And it is equally true that it would appear to you as if there were no accuracy issues storing 1.1 when x is stored in float precision if Stata, observing that x is float, performed calculations involving x in float. The fact is that there are accuracy issues in both cases.

“Wait,” you are probably thinking. “I understand your argument, but I’ve always heard that float is inaccurate and double is accurate. I understand from your argument that it is only a matter of degree but, in this case, those two degrees are on opposite sides of an important line.”

“No,” I respond.

What you have heard is right with respect to calculation. What you have heard might apply to data storage too, but that is unlikely. It turns out that float provides plenty of precision to store most real measurements.

Float is plenty accurate to store most data

The misconception that float precision is inaccurate comes from the true statement that float precision is not accurate enough when it comes to making calculations with stored values. Whether float precision is accurate enough for storing values depends solely on the accuracy with which the values are measured.

Float precision provides 24 base-2 (binary) digits, and thus values stored in float precision have a maximum relative error error of plus-or-minus 2^(-24) = 5.96e-08, or less than +/-1 part in 15 million.

1. The U.S. deficit in 2011 is projected to be $1.5 trillion. Stored as a float, the number has a (maximum) error of 2^(-24) * 1.5e+12 =$89,407. That is, if the true number is 1.5 trillion, the number recorded in float precision is guaranteed to be somewhere in the range [(1.5e+12)-89,407, (1.5e+14)+89,407]. The projected U.S. deficit is not known to an accuracy of +/-$89,407. 2. People in the US work about 40 hours per week, or roughly 0.238 of the hours in the week. 2^(-24) * 0.238 = 1.419e-09 of a week, or 0.1 milliseconds. Time worked in a week is not known to an accuracy of +/-0.1 milliseconds. 3. A cancer survivor might live 350 days. 2^(-24) * 350 = .00002086, or 1.8 seconds. Time of death is rarely recorded to an accuracy of +/-1.8 seconds. Time of diagnosis is never recorded to such accuracy, nor could it be. 4. The moon is said to be 384,401 kilometers from the Earth. 2^(-24) * 348,401 = 0.023 kilometers, or 23 meters. At its closest and farthest, the moon is 356,400 and 406,700 kilometers from Earth. 5. Most fundamental constants of the universe are known to a few parts in a million, which is to say, less than 1 part in 15 million, the accuracy float precision can provide. An exception is the speed of light, measured to be 299,793.458 kilometers per second. Record that as a float and you will be off by 0.01 km/s. In all the examples except the last, quoted are worst-case scenarios. The actual errors depend on the exact number and is a more tedious calculation (not shown): 1. For the U.S. deficit, the exact error for 1.5 trillion is -$26,624, which is within the plus or minus $89,407 quoted. 2. For fraction of the week, at 0.238 the error is -0.04 milliseconds, which is within the +/-0.1 milliseconds quoted. 3. For cancer survival time, at 350 days the actual error is 0, which is within the +/-1.8 seconds quoted. 4. For the distance between the Earth and moon, the actual error is 0, which is within within the +/-23 meters quoted. The actual errors may be interesting, but the maximum errors are more useful. Remember the multiplier 2^(-24). All you have to do is multiply a measurement by 2^(-24) and compare the result with the inherent error in the measurement. If 2^(-24) multiplied by the measurement is less than the inherent error, you can use float precision to store your data. Otherwise, you need to use double. By the way, the formula maximum_error = 2^(-24) * x is an approximation. The true formula is maximum_error = 2^(-24) * 2^(floor(log2(x))) It can be readily proven that x ≥ 2^(floor(log2(x))) and thus the approximation formula overstates the maximum error. The approximation formula can overstate the maximum error by as much as a factor of 2. Float precision is adequate for most data. There is one kind of data, however, where float precision may not be adequate, and that is financial data such as sales data, general ledgers, and the like. People working with dollar-and-cent data, or Euro-and-Eurocent data, or Pound Stirling-and-penny data, or any other currency data, usually find it best to use doubles. To avoid rounding issues, it is preferable to store the data as pennies. Float precision binary cannot store 0.01, 0.02, and the like, exactly. Integer values, however, can be stored exactly, at least up to certain 16,777,215. Floats can store up to 16,777,215 exactly. If stored your data in pennies, that would correspond to$167,772.15.

Doubles can store up to 9,007,199,254,740,991 exactly. If you stored your data in pennies, the would correspond to $90,071,992,547,409.91, or just over$90 trillion.

Why don’t I have these problems using Excel?

You do not have these problems when you use Excel because Excel stores numeric values in double precision. As I explained in How float(1.1) solves the problem above,

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

You can adopt the Excel solution in Stata by typing

. set type double, permanently


You will double (or more) the amount of memory Stata uses to store your data, but if that is not of concern to you, there are no other disadvantages to adopting this solution. If you adopt this solution and later wish to change your mind, type

. set type float, permanently


That’s all for today

If you enjoyed the above, you may want to see Part II (forthcoming). As I said, There are a few technical issues underlying what is written above that may interest those interested in computer science as it applies to statistical computing.

Categories: Numerical Analysis Tags:

Merging data, part 2: Multiple-key merges

Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as

        . merge 1:1 personid using ...


In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as

        . merge 1:1 personid date using ...


In this dataset we have repeated observations on persons and, within person, the observations can be uniquely identified by the calendar date.

Just to fix ideas, let’s assume I have two datasets. The first, sample.dta, is the one of analytic interest to me. I have data on 737 persons. For each person, I have data recorded on the first and fifteenth of every month, from year 2000 to 2009. Overall, my dataset has 176,880 observations.

The second dataset contains additional information (variables) on the sample of interest. It has over 3,000 people in it and it covers a longer stretch of time. I’ve been told that most of my 737 people should be in this second dataset, but I’ve been warned that, due to data collection or data processing errors over the years, a small fraction will not be found.

“How many?” I asked Bob from data processing during a fictional conversation, my paranoia kicking in.

“I don’t know. Most. Ninety-nine percent. It’s just random stuff,” he replied, knowing how well I respond to the word random.

Let’s call this second set of data the payroll data, although if I can imagine fictional conversations, you can imagine the data are something else. They might be records from follow-up visits of a medical experiment.

In any case, I receive the data, and here is what happened when I merged the data with my sample:

        . use sample, clear

. merge 1:1 personid date using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                         2,352
from master                     2,352  (_merge==1)
from using                          0  (_merge==2)

matched                           174,528  (_merge==3)
-----------------------------------------


In my sample data, I have 174,520 + 2,352 = 176,872 observations. Of those, 174,528 matched, which is 98.7 percent. (The reason that the number of records from the using (payroll) data that were not matched is zero is because I specified option keep(master match), meaning I discarded the unmatched payroll records. Had I not, the number would have been in the low millions.)

For many in this situation, the story would stop right here. Not for me. I want to show you how to tear into multiple-key merges to reassure yourself that things really are as they appear. You realize, of course, that I manufactured this fictional data for this blog entry and I buried a little something that once we find it, would scare you if this were a real story. So I’ll tell you now, this story is loosely based on a real story.

Step 1: Following my own advice

In Merging data, part 1 I recommended that you merge on all common variables, not just the identification variables. This blog entry is not going to rehash the previous blog entry, but I want to emphasize that everything I said in the previous entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division in which the employee works, so I am included it among the match variables:

        . use sample, clear

. merge 1:1 personid date division using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                         2,352
from master                     2,352  (_merge==1)
from using                          0  (_merge==2)

matched                           174,528  (_merge==3)
-----------------------------------------


The output above matches the output when I merged date and division alone, so I do not appear to have a merge-gone-bad problem. These merged data are looking better and better.

Step 2: Merge on each key variable in isolation

Let’s imagine what could go wrong. Imagine that all the data for certain persons were missing, or that all the data for certain dates were missing. That might not be a problem, but it would certainly raise questions. Depending on the answers, it may be worth a footnote or concerning enough to return the data and ask for a refund.

Finding persons or dates that are entirely unmatched is a lot of work unless you know the following trick: Merge on one key variable at a time.

Let’s start with personid:

        . use sample, clear

. sort personid

. by personid: keep if _n==1           // <- don't skip this step
(176143 observations deleted)

. merge 1:m personid using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                             0
matched                           174,528  (_merge==3)
-----------------------------------------


The output above proves that payroll.dta contains data on every person that appears in sample.dta.

Let me explain. I began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. merge reports that 174,528 personids matched, and 0 did not. Ergo, every value of personid that appears in sample.dta also appears in payroll.dta.

Had merge reported "not matched" = 2, that would mean there would have been two values of personid appearing in sample.dta that do not appear in payroll.dta. It would not have been an indictment of the data if two persons were not matched in their entirety, but I would certainly have looked into the issue. With the merged result in memory, I would have typed

        . list personid if _merge==1
(output would appear)


I would have written down the two personids list reported. Then I would have returned to my sample data and looked at the data I had on those two people:

        . use sample, clear

. list if personid==735527
(output would appear)

. list if personid==29887
(output would appear)


It might be that 735527 was with the company for only a short time and thus the missing payroll record a believable random event. If 735527 had been with the company all ten years, however, I would be back on the phone seeking an explanation. (If these were medical data, certainly you would want to know how a person who never reported for a follow-up visit is known to still be alive after ten years.)

So much for personid. Let's do the same for date:

        . use sample, clear
. sort date
. by date: keep if _n==1
(176640 observations deleted)
. merge 1:m date using payroll, keep(master match)
Result                           # of obs.
-----------------------------------------
not matched                             0
matched                           236,832  (_merge==3)
-----------------------------------------


Every date that appears in sample.dta also appears in payroll.dta.

Finally, let's look at division:

        . use sample, clear

. sort division date

. by division date: keep if _n==1
(175200 observations deleted)

. merge 1:m division date using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                            24
from master                        24  (_merge==1)
from using                          0  (_merge==2)

matched                           236,832  (_merge==3)
-----------------------------------------


Every division that appears in sample.dta appears in payroll.dta

These data are looking better and better.

If we had only two key variables, we would be done. We, however, are performing the full merge on three variables, namely personid, date, and division, and so there is one more set of comparisons we should examine.

Step 3: Merge on every pair of key variables

With three key variables, the possible pairs are (personid, date), (personid, division), and (division, date). We have already looked at (personid, date), so that just leaves (personid, division) and (division, date).

The method is the same as in Step 2 except that we type two variables where we previously typed one:

        . use sample, clear

. sort personid division

. by personid division: keep if _n==1
(176143 observations deleted)

. merge 1:m personid division using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                             0
matched                           174,528  (_merge==3)
-----------------------------------------


We discover that every personid-division combination that appears in sample.dta also appears in payroll.dta.

Last is (division, date):

        . use sample, clear

. sort division date

. by division date: keep if _n==1
(175200 observations deleted)

. merge 1:m division date using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                            24
from master                        24  (_merge==1)
from using                          0  (_merge==2)

matched                           236,832  (_merge==3)
-----------------------------------------


Surprise! Well, you're not surprised because I told you earlier we were going to find something, but if this were real life, you would be surprised after all these reassuring results.

We discover that there are 24 division-date combinations appearing in sample.dta that do not appear in payroll.dta. Let's look at the 24 missing combinations:

        . list division date if _merge==1

+----------------------+
| division        date |
|----------------------|
1129. |        5   01jan2007 |
1130. |        5   15jan2007 |
1131. |        5   01feb2007 |
1132. |        5   15feb2007 |
1133. |        5   01mar2007 |
|----------------------|
1134. |        5   15mar2007 |
1135. |        5   01apr2007 |
1136. |        5   15apr2007 |
1137. |        5   01may2007 |
1138. |        5   15may2007 |
|----------------------|
1139. |        5   01jun2007 |
1140. |        5   15jun2007 |
1141. |        5   01jul2007 |
1142. |        5   15jul2007 |
1143. |        5   01aug2007 |
|----------------------|
1144. |        5   15aug2007 |
1145. |        5   01sep2007 |
1146. |        5   15sep2007 |
1147. |        5   01oct2007 |
1148. |        5   15oct2007 |
|----------------------|
1149. |        5   01nov2007 |
1150. |        5   15nov2007 |
1151. |        5   01dec2007 |
1152. |        5   15dec2007 |
+----------------------+


If you look closely, you will notice that every payroll date in 2007 is listed. So what happened to the payroll records for division 5 in 2007? This may indeed be exactly the kind of random event that Bob had in mind during our fictional conversation. Somehow the company lost a little cluster of payroll records. The loss may mean mean nothing. Or it might be of critical importance. Imagine there's been an allegation that the company treats older workers poorly and imagine that division 5 has the highest average age. Not random. Not random at all.

Step 4: Merge on every triplet of key variables

So much for the fictional story.

If we had four or more key variables, we would now need to merge on every triplet of key variables, and if we had five or more key variables, we then need to merge on every quadruplet of key variables, and if ...

Forget the story. Or remember it if it scares you. Data processing and paranoia make an excellent pairing. What's important is how easy it is to take complicated, multiple-key merges apart. I've never met anyone yet who knew this trick.

Categories: Data Management Tags:

Merging data, part 1: Merges gone bad

Merging concerns combining datasets on the same observations to produce a result with more variables. We will call the datasets one.dta and two.dta.

When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this:

              +-------------------+
| var1  var2  var3  |      one.dta
+-------------------+
1. | one.dta           |
2. |                   |
. |                   |
. |                   |
+-------------------+

+

+-------------------+
| var1  var2  var3  |      two.dta
+-------------------+
1. | two.dta           |
2. |                   |
. |                   |
+-------------------+

=

+-------------------+
| var1  var2  var3  |
+-------------------+
1. |                   |    one.dta
2. |                   |
. |                   |
. |                   |
+                   +      +
N1+1. |                   |    two.dta   appended
N2+2. |                   |
. |                   |
+-------------------+



Merging looks like this:


+-------------------+           +-----------+
| var1  var2  var3  |           | var4 var5 |
+-------------------+           +-----------+
1. |                   |        1. |           |
2. |                   |    +   2. |           |     =
. |                   |         . |           |
. |                   |         . |           |
+-------------------+           +-----------+
one.dta                         two.dta

+-------------------+-----------+
| var1  var2  var3    var4 var5 |
+-------------------------------+
1. |                               |
2. |                               |
. |                               |
. |                               |
+-------------------+-----------+
one.dta           + two.dta    merged



The matching of the two datasets — deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation 1 with two.dta observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing

. use one, clear

. merge 1:1 using two


Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.

The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” with id=”Bob”, id=”United States” with id=”United States”, and id=4934934193 with id=4934934193. In Stata, you do this by typing

. use one, clear

. merge 1:1 id using two


Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.

It is the mistakenly combined observations that can go unnoticed.

And that is the topic for today, mistakenly matched observations, or merges gone bad.

Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.

I am going to assume that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. If any of this is untrue, read [D] merge. Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.

Detecting when merges go bad

As I said, the topic for today is merges gone bad, by which I mean producing a merged result with the wrong records combined. It is difficult to imagine that typing

. use one, clear

. merge 1:1 id using two


could produce such a result because, to be matched, the observations had to have equal values of the ID. Bob matched with Bob, Mary matched with Mary, and so on.

Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id==4713 means Bob in one dataset and Mary in the other? That can happen if the id variable is simply wrong from the outset or if the id variable became corrupted in prior processing.

1. Use theory to check IDs if they are numeric

One way the id variable can become corrupted is if it is not stored properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.

Say the identification variable is Social Security number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing perfectly valid Social Security numbers recorded in just this manner. Say you read the number as a float. Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mary has 888888921, and Bob appears in dataset one and Mary in dataset two, then Bob and Mary will be combined because they share the same rounded Social Security number.

Always be suspicious of numeric ID variables stored numerically, not just those stored as floats.

When I read raw data and store the ID variables as numeric, I worry whether I have specified a storage type sufficient to avoid rounding. When I obtain data from other sources that contain numeric ID variables, I assume that the other source improperly stored the values until proven otherwise.

Perhaps you remember that 16,775,215 is the largest integer that can be stored precisely as a float and 9,007,199,254,740,991 is the largest that can be stored precisely as a double. I never do.

Instead, I ask Stata to show me the largest theoretical ID number in hexadecimal. For Social Security numbers, the largest is 999-99-9999, so I type

. inbase 16 999999999
3b9ac9ff


Stata’s inbase command converts decimal numbers to different bases. I learn that 999999999 base-10 is 3b9ac9ff base-16, but I don’t care about the details; I just want to know the number of base-16 digits required. 3b9ac9ff has 8 digits. It takes 8 base-16 digits to record 999999999. As you learned in How to read the %21x format, part 2, I do remember that doubles can record 13 base-16 digits and floats can record 5.75 digits (the 0.75 part being because the last digit must be even). If I didn’t remember those numbers, I would just display a number in %21x format and count the digits to the right of the binary point. Anyway, Social Security numbers can be stored in doubles because 8<13, the number of digits double provides, but not in floats because 8 is not < 5.75, the number of digits float provides.

If Social Security numbers contained 12 digits rather than 9, the largest would be

. inbase 16 999999999999
38d4a50fff


which has 10 base-16 digits, and because 10<13, it would still fit into a double.

Anyway, if I discover that the storage type is insufficient to store the ID number, I know the ID numbers must be rounded.

2. Check uniqueness of IDs

I said that when I obtain data from other sources, I assume that the other source improperly stored the ID variables until proven otherwise. I should have said, until evidence accumulates to the contrary. Even if the storage type used is sufficient, I do not know what happened in previous processing of the data.

Here’s one way using datasets one.dta and two.dta to accumulate some of that evidence:

. use one, clear              // test 1
. sort id
. by id: assert _N==1

. use two, clear              // test 2
. sort id . by id: assert _N==1


In these tests, I am verifying that the IDs really are unique in the two datasets that I have. Tests 1 and 2 are unnecessary when I plan later to merge 1:1 because the 1:1 part will cause Stata itself to check that the IDs are unique. Nevertheless, I run the tests. I do this because the datasets I merge are often subsets of the original data, and I want to use all the evidence I have to invalidate the claim that the ID variables really are unique.Sometimes I receive datasets where it takes two variables to make sure I am calling a unique ID. Perhaps I receive data on persons over time, along with the claim that the ID variable is name. The documentation also notes that variable date records when the observation was made. Thus, to uniquely identify each of the observations requires both name and date, and I type

. sort name date
. by name date: assert _N==1


I am not suspicious of only datasets I receive. I run this same test on datasets I create.

3. Merge on all common variables

At this point, I know the ID variable(s) are unique in each dataset. Now I consider the idea that the ID variables are inconsistent across datasets, which is to say that Bob in one dataset, however he is identified, means Mary in the other. Detecting such problems is always problematic, but not nearly as problematic as you might guess.

It is rare that the datasets I need to merge have no variables in common except the ID variable. If the datasets are on persons, perhaps both datasets contain each person’s sex. In that case, I could merge the two datasets and verify that the sex is the same in both. Actually, I can do something easier than that: I can add variable sex to the key variables of the merge:

. use one, clear
. merge 1:1 id sex using two


Assume I have a valid ID variable. Then adding variable sex does not affect the outcome of the merge because sex is constant within id. I obtain the same results as typing merge 1:1 id using two.

Now assume the id variable is invalid. Compared with the results of merge 1:1 id using two, Bob will no longer match with Mary even if they have the same ID. Instead I will obtain separate, unmatched observations for Bob and Mary in the merged data. Thus to complete the test that there are no such mismatches, I must verify that the id variable is unique in the merged result. The complete code reads

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1


And now you know why in test 2 I checked the uniqueness of ID within dataset by hand rather than depending on merge 1:1. The 1:1 merge I just performed is on id and sex, and thus merge does not check the uniqueness of ID in each dataset. I checked by hand the uniqueness of ID in each dataset and then checked the uniqueness of the result by hand, too.

Passing the above test does not prove that that the ID variable is consistent and thus the merge is correct, but if the assertion is false, I know with certainty either that I have an invalid ID variable or that sex is miscoded in one of the datasets. If my data has roughly equal number of males and females, then the test has a 50 percent chance of detecting a mismatched pair of observations, such as Bob and Mary. If I have just 10 mismatched observations, I have a 1-0.910 = 0.9990 probability of detecting the problem.

I should warn you that if you want to keep just the matched observations, do not perform the merge by coding merge 1:1 id sex using two, keep(matched). You must keep the unmatched observations to perform the final part of the test, namely, that the ID numbers are unique. Then you can drop the unmatched observations.

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1
. keep if _merge==3


There may be more than one variable that you expect to be the same in combined observations. A convenient feature of this test is that you can add as many expected-to-be-constant variables to merge‘s keylist as you wish:

. use one, clear
. merge 1:1 id sex hiredate groupnumber using two
. sort id
. by id: assert _N==1
. keep if _merge==3


It is rare that there is not at least one variable other than the ID variable that is expected to be equal, but it does happen. Even if you have expected-to-be-constant variables, they may not work as well in detecting problems as variable sex in the example above. The distribution of the variable matters. If your data are of people known to be alive in 1980 and the known-to-be-constant variable is whether born after 1900, even mismatched observations would be likely to have the same value of the variable because most people alive in 1980 were born after 1900.

4. Look at a random sample

This test is weak, but you should do it anyway, if only because it’s so easy. List some of the combined observations and look at them.

. list in 1/5


Do the combined results look like they go together?

By the way, the right way to do this is

. gen u = uniform()
. sort u
. list in 1/5
. drop u


You do not want to look at the first observations because, having small values of ID, they are probably not representative. However IDs are assigned, the process is unlikely to be randomized. Persons with low values of ID will be younger, or older; or healthier, or sicker; or ….

5. Look at a nonrandom sample

You just merged two datasets, so obviously you did that because you needed the variables and those variables are somehow related to the existing variables. Perhaps your data is on persons, and you combined the 2009 data with the 2010 data. Perhaps your data is on countries, and you added export data to your import data. Whatever you just added, it is not random. If it were, you could have saved yourself time by simply generating the new variables containing random numbers.

So generate an index that measures a new variable in terms of an old one, such as

. gen diff = income2010 - income2009


or

. gen diff = exports - imports


Then sort on the variable and look at the observations containing the most outlandish values of your index:

. sort diff
. list in  1/5
. list in -5/l


These are the observations most likely to be mistakenly combined. Do you believe those observations were combined correctly?

Conclusion

I admit I am not suspicious of every merge I perform. I have built up trust over time in datasets that I have worked with previously. Even so, my ability to make errors is equal to yours, and even with trustworthy datasets, I can introduce problems long before I get to the merge. You need to carefully consider the consequences of a mistake. I do not know anyone who performs merges who has not performed a merge gone bad. The question is whether he or she detected it. I hope so.

Categories: Data Management Tags:

Multiprocessor (core) software (think Stata/MP) and percent parallelization

When most people first think about software designed to run on multiple cores such as Stata/MP, they think to themselves, two cores, twice as fast; four cores, four times as fast. They appreciate that reality will somehow intrude so that two cores won’t really be twice as fast as one, but they imagine the intrusion is something like friction and nothing that an intelligently placed drop of oil can’t improve.

In fact, something inherent intrudes. In any process to accomplish something — even physical processes — some parts may be able to to be performed in parallel, but there are invariably parts that just have to be performed one after the other. Anyone who cooks knows that you sometimes add some ingredients, cook a bit, and then add others, and cook some more. So it is, too, with calculating xt = f(xt-1) for t=1 to 100 and t0=1. Depending on the form of f(), sometimes there’s no alternative to calculating x1 = f(x0), then calculating x2 = f(x1), and so on.

In any calculation, some proportion p of the calculation can be parallelized and the remainder, 1-p, cannot. Consider a calculation that takes T hours if it were performed sequentially on a single core. If we had an infinite number of cores and the best possible implementation of the code in parallelized form, the execution time would fall to (1-p)T hours. The part that could be parallelized, which ordinarily would run in pT hours, would run in literally no time at all once split across an infinite number of cores, and that would still leave (1-p)T hours to go. This is known as Amdahl’s Law.

We can generalize this formula to computers with a finite number of cores, say n of them. The parallelizable part of the calculation, the part that would ordinarily run in pT hours, will run in pT/n. The unparallelizable part will still take (1-p)T hours, so we have

Tn = pT/n + (1-p)T

As n goes to infinity, Tn goes to (1-pT).

Stata/MP is pretty impressively parallelized. We achieve p of 0.8 or 0.9 in many cases. We do not claim to have hit the limits of what is possible, but in most cases, we believe we are very close to those limits. Most estimation commands have p above 0.9, and linear regression is actually above 0.99! This is explained in more detail along with percentage parallelization details for all Stata commands in the Stata/MP Performance Report.

Let’s figure out the value of having more cores. Consider a calculation that would ordinarily require T = 1 hour. With p=0.8 and 2 cores, run times would fall to 0.6 hours; With p=0.9, 0.55 hours. That is very close to what would be achieved even with p=1, which is not possible. For 4 cores, run times would fall to 0.4 (p=0.8) and 0.325 (p=0.9). That’s good, but no where near the hoped for 0.25 that we would observe if p were 1.

In fact, to get to 0.25, we need about 16 cores. With 16 cores, run times fall to 0.25 (p=0.8) and 0.15625 (p=0.9). Going to 32 cores improves run times just a little, to 0.225 (p=0.8) and 0.128125 (p=0.9). Going to 64 cores, we would get 0.2125 (p=0.8) and 0.11384615 (p=0.9). There’s little gain at all because all the cores in the world combined, and more, cannot reduce run times to below 0.2 (p=0.8) and 0.1 (p=0.9).

Stata/MP supports up to 64 cores. We could make a version that supports 128 cores, but it would be a lot of work even though we would not have to write even one line of code. The work would be in running the experiments to set the tuning parameters.

It turns out there are yet other ways in which reality intrudes. In addition to some calculations such as xt = f(xt-1) not being parallelizable at all, it’s an oversimplification to say any calculation is parallelizable because there are issues of granularity and of diseconomies of scale, two related, but different, problems.

Let’s start with granularity. Consider making the calculation xt = f(zt) for t = 1 to 100, and let’s do that by splitting on the subscript t. If we have n=2 cores, we’ll assign the calculation for t = 1 to 50 to one core, and for t=51 to 100 to another. If we have four cores, we’ll split t into four parts. Granularity concerns what happens when we move from n=100 to n=101 cores. This problem can be split into only 100 parallelizable parts and the minimum run time is therefore max(T/n, T/100) and not T/n, as we previously assumed.

All problems suffer from granularity. Diseconomies of scale is a related issue, and it strikes sooner than granularity. Many, but not all problems suffer from diseconomies of scale. Rather than calculating f(zt) for t = 1 to 100, let’s consider calculating the sum of f(zt) for t = 1 to 100. We’ll make this calculation in parallel in the same way as we made the previous calculation, by splitting on t. This time, however, each subprocess will report back to us the sum over the subrange. To obtain the overall sum, we will have to add sub-sums. So if we have n=2 cores, core 1 will calculate the sum over t = 1 to 50, core 2 will calculate the sum for t = 51 to 100, and then, the calculation having come back together, the master core will have to calculate the sum of two numbers. Adding two numbers can be done in a blink of an eye.

But what if we split the problem across 100 cores? We would get back 100 numbers which we would then have to sum. Moreover, what if the calculation of f(zt) is trivial? In that case, splitting the calculation among all 100 cores might result in run times that are nearly equal to what we would observe performing the calculation on just one core, even though splitting the calculation between two cores would nearly halve the execution time, and splitting among four would nearly quarter it!

So what’s the maximum number of cores over which we should split this problem? It depends on the relative execution times of f(zt) and the the combination operator to be performed on those results (addition in this case).

It is the diseconomies of scale problem that bit us in the early versions of Stata/MP, at least in beta testing. We did not adequately deal with the problem of splitting calculations among fewer cores than were available. Fixing that problem was a lot of work and, for your information, we are still working on it as hardware becomes available with more and more cores. The right way to address the issue is to have calculation-by-calculation tuning parameters, which we do. But it takes a lot of experimental work to determine the values of those tuning parameters, and the greater the number of cores, the more accurately the values need to be measured. We have the tuning parameters determined accurately enough for up to 64 cores, although there are one or two which we suspect we could improve even more. We would need to do a lot of experimentation, however, to ensure we have values adequate for 128 cores. The irony is that we would be doing that to make sure we don’t use them all except when problems are large enough!

In any case, I have seen articles predicting and in some cases, announcing, computers with hundreds of cores. For applications with p approaching 1, those are exciting announcements. In the world of statistical software, however, these announcements are exciting only for those running with immense datasets.

Categories: Multiprocessing Tags:

Graphs, maps, and geocoding

Jim Hufford, Esq. had his first Stata lesson: “This is going to be awesome when I understand what all those little letters and things mean.”

Along those lines — awesome — Jim may want to see these nice Stata scatterplots from the “wannabe economists of the Graduate Institute of International and Development Studies in Geneva” at Rigotnomics.

If you want to graph data onto maps using Stata — and see another awesome graph — see Mitch Abdon’s “Fun with maps in Stata” over at the Stata Daily.

And if you’re interested in geocoding to obtain latitudes and longitudes from human-readable addresses or locations, see Adam Ozimek’s “Computers are taking our jobs: Stata nerds only edition” over at Modeled Behavior and see the related Stata Journal article “Stata utilities for geocoding and generating travel time and travel distance information” by Adam Ozimek and Daniel Miles.

Categories: Tags:

Pi is (still) wrong

See this video, by Vi Hart:

This link was passed on to me by my friend Marcello. I’ve been bold enough to make up words such as eigenaxis and eigenpoint, but it takes real courage to suggest redefining π, even when you’re right!

After seeing the video, you can go here and here to learn more about what is being proposed.

Don’t click on comments until you’ve seen the video. Ms. Hart does a better job presenting the proposal than any of us can.

Categories: Mathematics Tags:

Understanding matrices intuitively, part 2, eigenvalues and eigenvectors

Last time, I showed you a way to graph and to think about matrices. This time, I want to apply the technique to eigenvalues and eigenvectors. The point is to give you a picture that will guide your intuition, just as it was previously.

Before I go on, several people asked after reading part 1 for the code I used to generate the graphs. Here it is, both for part 1 and part 2: matrixcode.zip.

The eigenvectors and eigenvalues of matrix A are defined to be the nonzero x and λ values that solve

Ax = λx

I wrote a lot about Ax in the last post. Just as previously, x is a point in the original, untransformed space and Ax is its transformed value. λ on the right-hand side is a scalar.

Multiplying a point by a scalar moves the point along a line that passes through the origin and the point:

The figure above illustrates y=λx when λ>1. If λ were less than 1, the point would move toward the origin and if λ were also less than 0, the point would pass right by the origin to land on the other side. For any point x, y=λx will be somewhere on the line passing through the origin and x.

Thus Ax = λx means the transformed value Ax lies on a line passing through the origin and the original x. Points that meet that restriction are eigenvectors (or more correctly, as we will see, eigenpoints, a term I just coined), and the corresponding eigenvalues are the λ‘s that record how far the points move along the line.

Actually, if x is a solution to Ax = λx, then so is every other point on the line through 0 and x. That’s easy to see. Assume x is a solution to Ax = λx and substitute cx for x: Acx = λcx. Thus x is not the eigenvector but is merely a point along the eigenvector.

And with that prelude, we are now in a position to interpret Ax = λx fully. Ax = λx finds the lines such that every point on the line, say, x, transformed by Ax moves to being another point on the same line. These lines are thus the natural axes of the transform defined by A.

The equation Ax = λx and the instructions “solve for nonzero x and λ” are deceptive. A more honest way to present the problem would be to transform the equation to polar coordinates. We would have said to find θ and λ such that any point on the line (r, θ) is transformed to (λr, θ). Nonetheless, Ax = λx is how the problem is commonly written.

However we state the problem, here is the picture and solution for A = (2, 1 \ 1, 2)

I used Mata’s eigensystem() function to obtain the eigenvectors and eigenvalues. In the graph, the black and green lines are the eigenvectors.

The first eigenvector is plotted in black. The “eigenvector” I got back from Mata was (0.707 \ 0.707), but that’s just one point on the eigenvector line, the slope of which is 0.707/0.707 = 1, so I graphed the line y = x. The eigenvalue reported by Mata was 3. Thus every point x along the black line moves to three times its distance from the origin when transformed by Ax. I suppressed the origin in the figure, but you can spot it because it is where the black and green lines intersect.

The second eigenvector is plotted in green. The second “eigenvector” I got back from Mata was (-0.707 \ 0.707), so the slope of the eigenvector line is 0.707/(-0.707) = -1. I plotted the line y = –x. The eigenvalue is 1, so the points along the green line do not move at all when transformed by Ax; y=λx and λ=1.

Here’s another example, this time for the matrix A = (1.1, 2 \ 3, 1):

The first “eigenvector” and eigenvalue Mata reported were… Wait! I’m getting tired of quoting the word eigenvector. I’m quoting it because computer software and the mathematical literature call it the eigenvector even though it is just a point along the eigenvector. Actually, what’s being described is not even a vector. A better word would be eigenaxis. Since this posting is pedagogical, I’m going to refer to the computer-reported eigenvector as an eigenpoint along the eigenaxis. When you return to the real world, remember to use the word eigenvector.

The first eigenpoint and eigenvalue that Mata reported were (0.640 \ 0.768) and λ = 3.45. Thus the slope of the eigenaxis is 0.768/0.640 = 1.2, and points along that line — the green line — move to 3.45 times their distance from the origin.

The second eigenpoint and eigenvalue Mata reported were (-0.625 \ 0.781) and λ = -1.4. Thus the slope is -0.781/0.625 = -1.25, and points along that line move to -1.4 times their distance from the origin, which is to say they flip sides and then move out, too. We saw this flipping in my previous posting. You may remember that I put a small circle and triangle at the bottom left and bottom right of the original grid and then let the symbols be transformed by A along with the rest of space. We saw an example like this one, where the triangle moved from the top-left of the original space to the bottom-right of the transformed space. The space was flipped in one of its dimensions. Eigenvalues save us from having to look at pictures with circles and triangles; when a dimension of the space flips, the corresponding eigenvalue is negative.

We examined near singularity last time. Let’s look again, and this time add the eigenaxes:

The blue blob going from bottom-left to top-right is both the compressed space and the first eigenaxis. The second eigenaxis is shown in green.

Mata reported the first eigenpoint as (0.789 \ 0.614) and the second as (-0.460 \ 0.888). Corresponding eigenvalues were reported as 2.78 and 0.07. I should mention that zero eigenvalues indicate singular matrices and small eigenvalues indicate nearly singular matrices. Actually, eigenvalues also reflect the scale of the matrix. A matrix that compresses the space will have all of its eigenvalues be small, and that is not an indication of near singularity. To detect near singularity, one should look at the ratio of the largest to the smallest eigenvalue, which in this case is 0.07/2.78 = 0.03.

Despite appearances, computers do not find 0.03 to be small and thus do not think of this matrix as being nearly singular. This matrix gives computers no problem; Mata can calculate the inverse of this without losing even one binary digit. I mention this and show you the picture so that you will have a better appreciation of just how squished the space can become before computers start complaining.

When do well-programmed computers complain? Say you have a matrix A and make the above graph, but you make it really big — 3 miles by 3 miles. Lay your graph out on the ground and hike out to the middle of it. Now get down on your knees and get out your ruler. Measure the spread of the compressed space at its widest part. Is it an inch? That’s not a problem. One inch is roughly 5*10-6 of the original space (that is, 1 inch by 3 miles wide). If that were a problem, users would complain. It is not problematic until we get around 10-8 of the original area. Figure about 0.002 inches.

There’s more I could say about eigenvalues and eigenvectors. I could mention that rotation matrices have no eigenvectors and eigenvalues, or at least no real ones. A rotation matrix rotates the space, and thus there are no transformed points that are along their original line through the origin. I could mention that one can rebuild the original matrix from its eigenvectors and eigenvalues, and from that, one can generalize powers to matrix powers. It turns out that A-1 has the same eigenvectors as A; its eigenvalues are λ-1 of the original’s. Matrix AA also has the same eigenvectors as A; its eigenvalues are λ2. Ergo, Ap can be formed by transforming the eigenvalues, and it turns out that, indeed, A½ really does, when multiplied by itself, produce A.

Categories: Linear Algebra Tags:

Understanding matrices intuitively, part 1

I want to show you a way of picturing and thinking about matrices. The topic for today is the square matrix, which we will call A. I’m going to show you a way of graphing square matrices, although we will have to limit ourselves to the 2 x 2 case. That will be, as they say, without loss of generality. The technique I’m about to show you could be used with 3 x 3 matrices if you had a better 3-dimensional monitor, and as will be revealed, it could be used on 3 x 2 and 2 x 3 matrices, too. If you had more imagination, we could use the technique on 4 x 4, 5 x 5, and even higher-dimensional matrices.

But we will limit ourselves to 2 x 2. A might be

From now on, I’ll write matrices as

A = (2, 1 \ 1.5, 2)

where commas are used to separate elements on the same row and backslashes are used to separate the rows.

To graph A, I want you to think about

y = Ax

where

y: 2 x 1,

A: 2 x 2, and

x: 2 x 1.

That is, we are going to think about A in terms of its effect in transforming points in space from x to y. For instance, if we had the point

x = (0.75 \ 0.25)

then

y = (1.75 \ 1.625)

because by the rules of matrix multiplication y[1] = 0.75*2 + 0.25*1 = 1.75 and y[2] = 0.75*1.5 + 0.25*2 = 1.625. The matrix A transforms the point (0.75 \ 0.25) to (1.75 \ 1.625). We could graph that:

To get a better understanding of how A transforms the space, we could graph additional points:

I do not want you to get lost among the individual points which A could transform, however. To focus better on A, we are going to graph y = Ax for all x. To do that, I’m first going to take a grid,

One at a time, I’m going to take every point on the grid, call the point x, and run it through the transform y = Ax. Then I’m going to graph the transformed points:

Finally, I’m going to superimpose the two graphs:

In this way, I can now see exactly what A = (2, 1 \ 1.5, 2) does. It stretches the space, and skews it.

I want you to think about transforms like A as transforms of the space, not of the individual points. I used a grid above, but I could just as well have used a picture of the Eiffel tower and, pixel by pixel, transformed it by using y = Ax. The result would be a distorted version of the original image, just as the the grid above is a distorted version of the original grid. The distorted image might not be helpful in understanding the Eiffel Tower, but it is helpful in understanding the properties of A. So it is with the grids.

Notice that in the above image there are two small triangles and two small circles. I put a triangle and circle at the bottom left and top left of the original grid, and then again at the corresponding points on the transformed grid. They are there to help you orient the transformed grid relative to the original. They wouldn’t be necessary had I transformed a picture of the Eiffel tower.

I’ve suppressed the scale information in the graph, but the axes make it obvious that we are looking at the first quadrant in the graph above. I could just as well have transformed a wider area.

Regardless of the region graphed, you are supposed to imagine two infinite planes. I will graph the region that makes it easiest to see the point I wish to make, but you must remember that whatever I’m showing you applies to the entire space.

We need first to become familiar with pictures like this, so let’s see some examples. Pure stretching looks like this:

Pure compression looks like this:

Pay attention to the color of the grids. The original grid, I’m showing in red; the transformed grid is shown in blue.

A pure rotation (and stretching) looks like this:

Note the location of the triangle; this space was rotated around the origin.

Here’s an interesting matrix that produces a surprising result: A = (1, 2 \ 3, 1).

This matrix flips the space! Notice the little triangles. In the original grid, the triangle is located at the top left. In the transformed space, the corresponding triangle ends up at the bottom right! A = (1, 2 \ 3, 1) appears to be an innocuous matrix — it does not even have a negative number in it — and yet somehow, it twisted the space horribly.

So now you know what 2 x 2 matrices do. They skew,stretch, compress, rotate, and even flip 2-space. In a like manner, 3 x 3 matrices do the same to 3-space; 4 x 4 matrices, to 4-space; and so on.

Well, you are no doubt thinking, this is all very entertaining. Not really useful, but entertaining.

Okay, tell me what it means for a matrix to be singular. Better yet, I’ll tell you. It means this:

A singular matrix A compresses the space so much that the poor space is squished until it is nothing more than a line. It is because the space is so squished after transformation by y = Ax that one cannot take the resulting y and get back the original x. Several different x values get squished into that same value of y. Actually, an infinite number do, and we don’t know which you started with.

A = (2, 3 \ 2, 3) squished the space down to a line. The matrix A = (0, 0 \ 0, 0) would squish the space down to a point, namely (0 0). In higher dimensions, say, k, singular matrices can squish space into k-1, k-2, …, or 0 dimensions. The number of dimensions is called the rank of the matrix.

Singular matrices are an extreme case of nearly singular matrices, which are the bane of my existence here at StataCorp. Here is what it means for a matrix to be nearly singular:

Nearly singular matrices result in spaces that are heavily but not fully compressed. In nearly singular matrices, the mapping from x to y is still one-to-one, but x‘s that are far away from each other can end up having nearly equal y values. Nearly singular matrices cause finite-precision computers difficulty. Calculating y = Ax is easy enough, but to calculate the reverse transform x = A-1y means taking small differences and blowing them back up, which can be a numeric disaster in the making.

So much for the pictures illustrating that matrices transform and distort space; the message is that they do. This way of thinking can provide intuition and even deep insights. Here’s one:

In the above graph of the fully singular matrix, I chose a matrix that not only squished the space but also skewed the space some. I didn’t have to include the skew. Had I chosen matrix A = (1, 0 \ 0, 0), I could have compressed the space down onto the horizontal axis. And with that, we have a picture of nonsquare matrices. I didn’t really need a 2 x 2 matrix to map 2-space onto one of its axes; a 2 x 1 vector would have been sufficient. The implication is that, in a very deep sense, nonsquare matrices are identical to square matrices with zero rows or columns added to make them square. You might remember that; it will serve you well.

Here’s another insight:

In the linear regression formula b = (XX)-1Xy, (XX)-1 is a square matrix, so we can think of it as transforming space. Let’s try to understand it that way.

Begin by imagining a case where it just turns out that (XX)-1 = I. In such a case, (XX)-1 would have off-diagonal elements equal to zero, and diagonal elements all equal to one. The off-diagonal elements being equal to 0 means that the variables in the data are uncorrelated; the diagonal elements all being equal to 1 means that the sum of each squared variable would equal 1. That would be true if the variables each had mean 0 and variance 1/N. Such data may not be common, but I can imagine them.

If I had data like that, my formula for calculating b would be b = (XX)-1Xy = IXy = Xy. When I first realized that, it surprised me because I would have expected the formula to be something like b = X-1y. I expected that because we are finding a solution to y = Xb, and b = X-1y is an obvious solution. In fact, that’s just what we got, because it turns out that X-1y = Xy when (XX)-1 = I. They are equal because (XX)-1 = I means that XX = I, which means that X‘ = X-1. For this math to work out, we need a suitable definition of inverse for nonsquare matrices. But they do exist, and in fact, everything you need to work it out is right there in front of you.

Anyway, when correlations are zero and variables are appropriately normalized, the linear regression calculation formula reduces to b = Xy. That makes sense to me (now) and yet, it is still a very neat formula. It takes something that is N x k — the data — and makes k coefficients out of it. Xy is the heart of the linear regression formula.

Let’s call b = Xy the naive formula because it is justified only under the assumption that (XX)-1 = I, and real XX inverses are not equal to I. (XX)-1 is a square matrix and, as we have seen, that means it can be interpreted as compressing, expanding, and rotating space. (And even flipping space, although it turns out the positive-definite restriction on XX rules out the flip.) In the formula (XX)-1Xy, (XX)-1 is compressing, expanding, and skewing Xy, the naive regression coefficients. Thus (XX)-1 is the corrective lens that translates the naive coefficients into the coefficient we seek. And that means XX is the distortion caused by scale of the data and correlations of variables.

Thus I am entitled to describe linear regression as follows: I have data (y, X) to which I want to fit y = Xb. The naive calculation is b = Xy, which ignores the scale and correlations of the variables. The distortion caused by the scale and correlations of the variables is XX. To correct for the distortion, I map the naive coefficients through (XX)-1.

Intuition, like beauty, is in the eye of the beholder. When I learned that the variance matrix of the estimated coefficients was equal to s2(XX)-1, I immediately thought: s2 — there’s the statistics. That single statistical value is then parceled out through the corrective lens that accounts for scale and correlation. If I had data that didn’t need correcting, then the standard errors of all the coefficients would be the same and would be identical to the variance of the residuals.

If you go through the derivation of s2(XX)-1, there’s a temptation to think that s2 is merely something factored out from the variance matrix, probably to emphasize the connection between the variance of the residuals and standard errors. One easily loses sight of the fact that s2 is the heart of the matter, just as Xy is the heart of (XX)-1Xy. Obviously, one needs to view both s2 and Xy though the same corrective lens.

I have more to say about this way of thinking about matrices. Look for part 2 in the near future. Update: part 2 of this posting, “Understanding matrices intuitively, part 2, eigenvalues and eigenvectors”, may now be found at http://blog.stata.com/2011/03/09/understanding-matrices-intuitively-part-2/.

Categories: Linear Algebra Tags: