## Multiprocessor (core) software (think Stata/MP) and percent parallelization

When most people first think about software designed to run on multiple cores such as Stata/MP, they think to themselves, two cores, twice as fast; four cores, four times as fast. They appreciate that reality will somehow intrude so that two cores won’t really be twice as fast as one, but they imagine the intrusion is something like friction and nothing that an intelligently placed drop of oil can’t improve.

In fact, something inherent intrudes. In any process to accomplish something — even physical processes — some parts may be able to to be performed in parallel, but there are invariably parts that just have to be performed one after the other. Anyone who cooks knows that you sometimes add some ingredients, cook a bit, and then add others, and cook some more. So it is, too, with calculating *x _{t}* = f(

*x*) for

_{t-1}*t*=1 to 100 and

*t*

_{0}=1. Depending on the form of f(), sometimes there’s no alternative to calculating

*x*

_{1}= f(

*x*

_{0}), then calculating

*x*

_{2}= f(

*x*

_{1}), and so on.

In any calculation, some proportion *p* of the calculation can be parallelized and the remainder, 1-*p*, cannot. Consider a calculation that takes T hours if it were performed sequentially on a single core. If we had an infinite number of cores and the best possible implementation of the code in parallelized form, the execution time would fall to (1-*p*)T hours. The part that could be parallelized, which ordinarily would run in *p*T hours, would run in literally no time at all once split across an infinite number of cores, and that would still leave (1-*p*)T hours to go. This is known as Amdahl’s Law.

We can generalize this formula to computers with a finite number of cores, say *n* of them. The parallelizable part of the calculation, the part that would ordinarily run in *p*T hours, will run in *p*T/*n*. The unparallelizable part will still take (1-*p*)T hours, so we have

T_{n} = *p*T/*n* + (1-*p*)T

As *n* goes to infinity, T_{n} goes to (1-*p*T).

Stata/MP is pretty impressively parallelized. We achieve *p* of 0.8 or 0.9 in many cases. We do not claim to have hit the limits of what is possible, but in most cases, we believe we are very close to those limits. Most estimation commands have *p* above 0.9, and linear regression is actually above 0.99! This is explained in more detail along with percentage parallelization details for all Stata commands in the *Stata/MP Performance Report*.

Let’s figure out the value of having more cores. Consider a calculation that would ordinarily require T = 1 hour. With *p*=0.8 and 2 cores, run times would fall to 0.6 hours; With *p*=0.9, 0.55 hours. That is very close to what would be achieved even with *p*=1, which is not possible. For 4 cores, run times would fall to 0.4 (*p*=0.8) and 0.325 (*p*=0.9). That’s good, but no where near the hoped for 0.25 that we would observe if *p* were 1.

In fact, to get to 0.25, we need about 16 cores. With 16 cores, run times fall to 0.25 (*p*=0.8) and 0.15625 (*p*=0.9). Going to 32 cores improves run times just a little, to 0.225 (*p*=0.8) and 0.128125 (*p*=0.9). Going to 64 cores, we would get 0.2125 (*p*=0.8) and 0.11384615 (*p*=0.9). There’s little gain at all because all the cores in the world combined, and more, cannot reduce run times to below 0.2 (*p*=0.8) and 0.1 (*p*=0.9).

Stata/MP supports up to 64 cores. We could make a version that supports 128 cores, but it would be a lot of work even though we would not have to write even one line of code. The work would be in running the experiments to set the tuning parameters.

It turns out there are yet other ways in which reality intrudes. In addition to some calculations such as *x _{t}* = f(

*x*) not being parallelizable at all, it’s an oversimplification to say any calculation is parallelizable because there are issues of granularity and of diseconomies of scale, two related, but different, problems.

_{t-1}Let’s start with granularity. Consider making the calculation *x _{t}* = f(

*z*) for

_{t}*t*= 1 to 100, and let’s do that by splitting on the subscript

*t*. If we have

*n*=2 cores, we’ll assign the calculation for

*t*= 1 to 50 to one core, and for

*t*=51 to 100 to another. If we have four cores, we’ll split

*t*into four parts. Granularity concerns what happens when we move from

*n*=100 to

*n*=101 cores. This problem can be split into only 100 parallelizable parts and the minimum run time is therefore max(T/

*n*, T/100) and not T/

*n*, as we previously assumed.

All problems suffer from granularity. Diseconomies of scale is a related issue, and it strikes sooner than granularity. Many, but not all problems suffer from diseconomies of scale. Rather than calculating f(*z _{t}*) for

*t*= 1 to 100, let’s consider calculating the sum of f(

*z*) for

_{t}*t*= 1 to 100. We’ll make this calculation in parallel in the same way as we made the previous calculation, by splitting on

*t*. This time, however, each subprocess will report back to us the sum over the subrange. To obtain the overall sum, we will have to add sub-sums. So if we have

*n*=2 cores, core 1 will calculate the sum over

*t*= 1 to 50, core 2 will calculate the sum for

*t*= 51 to 100, and then, the calculation having come back together, the master core will have to calculate the sum of two numbers. Adding two numbers can be done in a blink of an eye.

But what if we split the problem across 100 cores? We would get back 100 numbers which we would then have to sum. Moreover, what if the calculation of f(*z _{t}*) is trivial? In that case, splitting the calculation among all 100 cores might result in run times that are nearly equal to what we would observe performing the calculation on just one core, even though splitting the calculation between two cores would nearly halve the execution time, and splitting among four would nearly quarter it!

So what’s the maximum number of cores over which we should split this problem? It depends on the relative execution times of f(*z _{t}*) and the the combination operator to be performed on those results (addition in this case).

It is the diseconomies of scale problem that bit us in the early versions of Stata/MP, at least in beta testing. We did not adequately deal with the problem of splitting calculations among fewer cores than were available. Fixing that problem was a lot of work and, for your information, we are still working on it as hardware becomes available with more and more cores. The right way to address the issue is to have calculation-by-calculation tuning parameters, which we do. But it takes a lot of experimental work to determine the values of those tuning parameters, and the greater the number of cores, the more accurately the values need to be measured. We have the tuning parameters determined accurately enough for up to 64 cores, although there are one or two which we suspect we could improve even more. We would need to do a lot of experimentation, however, to ensure we have values adequate for 128 cores. The irony is that we would be doing that to make sure we don’t use them all except when problems are large enough!

In any case, I have seen articles predicting and in some cases, announcing, computers with hundreds of cores. For applications with *p* approaching 1, those are exciting announcements. In the world of statistical software, however, these announcements are exciting only for those running with immense datasets.