Stata/MP — having fun with millions
I was reviewing some timings from the Stata/MP Performance Report this morning. (For those who don’t know, Stata/MP is the version of Stata that has been programmed to take advantage of multiprocessor and multicore computers. It is functionally equivalent to the largest version of Stata, Stata/SE, and it is faster on multicore computers.)
What was unusual this morning is that I was running Stata/MP interactively. We usually run MP for large batch jobs that run thousands of timings on large datasets — either to tune performance or to produce reports like the Performance Report. That is the type of work Stata/MP was designed for — big jobs on big datasets.
I will admit right now that I mostly run Stata interactively using the auto dataset, which has 74 observations. I run Stata/MP using all 4 cores of my quad-core computer, but I am mostly wasting 3 of them — there is no speeding up the computations on 74 observations. This morning I was running Stata/MP interactively on a 24-core computer using a somewhat larger dataset.
After a while, I was struck by the fact that I wasn’t noticing any annoying delays waiting for commands to run. It felt almost as though I were running on the auto dataset. But I wasn’t. I was running commands using 50 covariates on 1 million observations! Regressions, summary statistics, etc.; this was fun. I had never played interactively with a million-observation dataset before.
Out of curiousity, I turned off multicore support. The change was dramatic. Commands that were taking less than a second were now taking longer, too long. My coffee cup was full, but I contemplated fetching a snack. Running on only one processor was not so much fun.
For your information, I set rmsg on and ran a few timings:
Timing (seconds) | ||
Analysis | 24 cores | 1 core |
generate a new variable | .03 | .33 |
summarize 50 variables | .88 | 19.55 |
twoway tabulation | .45 | .45 |
linear regression | .65 | 11.48 |
logistic regression | 7.19 | 59.27 |
All timings are on a 1 million observation dataset. The two regressions included 50 covariates. |
OK, the timings with 24 cores are not quite the same as with the auto dataset, but well within comfortable interactive use.
Careful readers will have noticed that the 24-core and 1-core timings for twoway tabulation are the same. We have not rewritten the code for tabulate to support multiple cores, partly because tabulate is already very fast, and partly because the code for tabulate is isolated, so changing it will not improve the performance of other commands. Thus, parallelizing tabulate is on our long-run, not short-run, list of additions to Stata/MP. We have rewritten about 250 sections of Stata’s internal code to support Symmetric Multi Processing (SMP). Each rewritten section typically improves the performance of many commands.
I switched back to using all 24 cores and returned to my original work — stress testing changes in the number of covariates and observations. My fun was quelled when I started running some timings of Cox proportional hazards regressions. With my 50 covariates and 1 million observations, a Cox regression took just over two minutes. Most estimators in Stata are parallelized, including the estimators for parametric survival models. The Cox proportional hazards estimator is not. It is not parallelized because it uses a clever algorithm that requires sequential computations. When I say sequential I mean that some computations are wholly dependent on previous computations so that they simply cannot be performed simultaneously, in parallel. There are other algorithms for fitting the Cox model, but they are orders of magnitude slower. Even parallelized, they would not be faster than our current sequential algorithm unless run on 20 or more processors. When more computers start shipping with dozens of cores, we will evaluate adding a parallelized algorithm for the Cox estimator.
The computer I was running on is about a year old. There have been a spate of new and faster server-grade processors from Intel and AMD in the past year. You can get reasonably close to the performance of my 24-core computer using just 8-cores and the newer chips. That means that with a newer 32-core computer, I could increase my threshold for interactive analysis to about 4 million observations.
There are four speed comparisons above. To see 450 more, including graphs and a discussion of SMP and its implementation in Stata, see the Stata/MP white paper, a.k.a. the Stata/MP Performance Report.