Posts Tagged ‘performance’

Multiprocessor (core) software (think Stata/MP) and percent parallelization

When most people first think about software designed to run on multiple cores such as Stata/MP, they think to themselves, two cores, twice as fast; four cores, four times as fast. They appreciate that reality will somehow intrude so that two cores won’t really be twice as fast as one, but they imagine the intrusion is something like friction and nothing that an intelligently placed drop of oil can’t improve.

In fact, something inherent intrudes. In any process to accomplish something — even physical processes — some parts may be able to to be performed in parallel, but there are invariably parts that just have to be performed one after the other. Anyone who cooks knows that you sometimes add some ingredients, cook a bit, and then add others, and cook some more. So it is, too, with calculating xt = f(xt-1) for t=1 to 100 and t0=1. Depending on the form of f(), sometimes there’s no alternative to calculating x1 = f(x0), then calculating x2 = f(x1), and so on.

In any calculation, some proportion p of the calculation can be parallelized and the remainder, 1-p, cannot. Consider a calculation that takes T hours if it were performed sequentially on a single core. If we had an infinite number of cores and the best possible implementation of the code in parallelized form, the execution time would fall to (1-p)T hours. The part that could be parallelized, which ordinarily would run in pT hours, would run in literally no time at all once split across an infinite number of cores, and that would still leave (1-p)T hours to go. This is known as Amdahl’s Law.

We can generalize this formula to computers with a finite number of cores, say n of them. The parallelizable part of the calculation, the part that would ordinarily run in pT hours, will run in pT/n. The unparallelizable part will still take (1-p)T hours, so we have

Tn = pT/n + (1-p)T

As n goes to infinity, Tn goes to (1-pT).

Stata/MP is pretty impressively parallelized. We achieve p of 0.8 or 0.9 in many cases. We do not claim to have hit the limits of what is possible, but in most cases, we believe we are very close to those limits. Most estimation commands have p above 0.9, and linear regression is actually above 0.99! This is explained in more detail along with percentage parallelization details for all Stata commands in the Stata/MP Performance Report.

Let’s figure out the value of having more cores. Consider a calculation that would ordinarily require T = 1 hour. With p=0.8 and 2 cores, run times would fall to 0.6 hours; With p=0.9, 0.55 hours. That is very close to what would be achieved even with p=1, which is not possible. For 4 cores, run times would fall to 0.4 (p=0.8) and 0.325 (p=0.9). That’s good, but no where near the hoped for 0.25 that we would observe if p were 1.

In fact, to get to 0.25, we need about 16 cores. With 16 cores, run times fall to 0.25 (p=0.8) and 0.15625 (p=0.9). Going to 32 cores improves run times just a little, to 0.225 (p=0.8) and 0.128125 (p=0.9). Going to 64 cores, we would get 0.2125 (p=0.8) and 0.11384615 (p=0.9). There’s little gain at all because all the cores in the world combined, and more, cannot reduce run times to below 0.2 (p=0.8) and 0.1 (p=0.9).

Stata/MP supports up to 64 cores. We could make a version that supports 128 cores, but it would be a lot of work even though we would not have to write even one line of code. The work would be in running the experiments to set the tuning parameters.

It turns out there are yet other ways in which reality intrudes. In addition to some calculations such as xt = f(xt-1) not being parallelizable at all, it’s an oversimplification to say any calculation is parallelizable because there are issues of granularity and of diseconomies of scale, two related, but different, problems.

Let’s start with granularity. Consider making the calculation xt = f(zt) for t = 1 to 100, and let’s do that by splitting on the subscript t. If we have n=2 cores, we’ll assign the calculation for t = 1 to 50 to one core, and for t=51 to 100 to another. If we have four cores, we’ll split t into four parts. Granularity concerns what happens when we move from n=100 to n=101 cores. This problem can be split into only 100 parallelizable parts and the minimum run time is therefore max(T/n, T/100) and not T/n, as we previously assumed.

All problems suffer from granularity. Diseconomies of scale is a related issue, and it strikes sooner than granularity. Many, but not all problems suffer from diseconomies of scale. Rather than calculating f(zt) for t = 1 to 100, let’s consider calculating the sum of f(zt) for t = 1 to 100. We’ll make this calculation in parallel in the same way as we made the previous calculation, by splitting on t. This time, however, each subprocess will report back to us the sum over the subrange. To obtain the overall sum, we will have to add sub-sums. So if we have n=2 cores, core 1 will calculate the sum over t = 1 to 50, core 2 will calculate the sum for t = 51 to 100, and then, the calculation having come back together, the master core will have to calculate the sum of two numbers. Adding two numbers can be done in a blink of an eye.

But what if we split the problem across 100 cores? We would get back 100 numbers which we would then have to sum. Moreover, what if the calculation of f(zt) is trivial? In that case, splitting the calculation among all 100 cores might result in run times that are nearly equal to what we would observe performing the calculation on just one core, even though splitting the calculation between two cores would nearly halve the execution time, and splitting among four would nearly quarter it!

So what’s the maximum number of cores over which we should split this problem? It depends on the relative execution times of f(zt) and the the combination operator to be performed on those results (addition in this case).

It is the diseconomies of scale problem that bit us in the early versions of Stata/MP, at least in beta testing. We did not adequately deal with the problem of splitting calculations among fewer cores than were available. Fixing that problem was a lot of work and, for your information, we are still working on it as hardware becomes available with more and more cores. The right way to address the issue is to have calculation-by-calculation tuning parameters, which we do. But it takes a lot of experimental work to determine the values of those tuning parameters, and the greater the number of cores, the more accurately the values need to be measured. We have the tuning parameters determined accurately enough for up to 64 cores, although there are one or two which we suspect we could improve even more. We would need to do a lot of experimentation, however, to ensure we have values adequate for 128 cores. The irony is that we would be doing that to make sure we don’t use them all except when problems are large enough!

In any case, I have seen articles predicting and in some cases, announcing, computers with hundreds of cores. For applications with p approaching 1, those are exciting announcements. In the world of statistical software, however, these announcements are exciting only for those running with immense datasets.

Stata/MP — having fun with millions

I was reviewing some timings from the Stata/MP Performance Report this morning. (For those who don’t know, Stata/MP is the version of Stata that has been programmed to take advantage of multiprocessor and multicore computers. It is functionally equivalent to the largest version of Stata, Stata/SE, and it is faster on multicore computers.)

What was unusual this morning is that I was running Stata/MP interactively. We usually run MP for large batch jobs that run thousands of timings on large datasets — either to tune performance or to produce reports like the Performance Report. That is the type of work Stata/MP was designed for — big jobs on big datasets.

I will admit right now that I mostly run Stata interactively using the auto dataset, which has 74 observations. I run Stata/MP using all 4 cores of my quad-core computer, but I am mostly wasting 3 of them — there is no speeding up the computations on 74 observations. This morning I was running Stata/MP interactively on a 24-core computer using a somewhat larger dataset.

After a while, I was struck by the fact that I wasn’t noticing any annoying delays waiting for commands to run. It felt almost as though I were running on the auto dataset. But I wasn’t. I was running commands using 50 covariates on 1 million observations! Regressions, summary statistics, etc.; this was fun. I had never played interactively with a million-observation dataset before.

Out of curiousity, I turned off multicore support. The change was dramatic. Commands that were taking less than a second were now taking longer, too long. My coffee cup was full, but I contemplated fetching a snack. Running on only one processor was not so much fun.

For your information, I set rmsg on and ran a few timings:

Timing (seconds)
Analysis 24 cores 1 core
generate a new variable .03 .33
summarize 50 variables .88 19.55
twoway tabulation .45 .45
linear regression .65 11.48
logistic regression 7.19 59.27
All timings are on a 1 million observation dataset.
The two regressions included 50 covariates.

OK, the timings with 24 cores are not quite the same as with the auto dataset, but well within comfortable interactive use.

Careful readers will have noticed that the 24-core and 1-core timings for twoway tabulation are the same. We have not rewritten the code for tabulate to support multiple cores, partly because tabulate is already very fast, and partly because the code for tabulate is isolated, so changing it will not improve the performance of other commands. Thus, parallelizing tabulate is on our long-run, not short-run, list of additions to Stata/MP. We have rewritten about 250 sections of Stata’s internal code to support Symmetric Multi Processing (SMP). Each rewritten section typically improves the performance of many commands.

I switched back to using all 24 cores and returned to my original work — stress testing changes in the number of covariates and observations. My fun was quelled when I started running some timings of Cox proportional hazards regressions. With my 50 covariates and 1 million observations, a Cox regression took just over two minutes. Most estimators in Stata are parallelized, including the estimators for parametric survival models. The Cox proportional hazards estimator is not. It is not parallelized because it uses a clever algorithm that requires sequential computations. When I say sequential I mean that some computations are wholly dependent on previous computations so that they simply cannot be performed simultaneously, in parallel. There are other algorithms for fitting the Cox model, but they are orders of magnitude slower. Even parallelized, they would not be faster than our current sequential algorithm unless run on 20 or more processors. When more computers start shipping with dozens of cores, we will evaluate adding a parallelized algorithm for the Cox estimator.

The computer I was running on is about a year old. There have been a spate of new and faster server-grade processors from Intel and AMD in the past year. You can get reasonably close to the performance of my 24-core computer using just 8-cores and the newer chips. That means that with a newer 32-core computer, I could increase my threshold for interactive analysis to about 4 million observations.

There are four speed comparisons above. To see 450 more, including graphs and a discussion of SMP and its implementation in Stata, see the Stata/MP white paper, a.k.a. the Stata/MP Performance Report.