Home > Multiprocessing > Multiprocessor (core) software (think Stata/MP) and percent parallelization

Multiprocessor (core) software (think Stata/MP) and percent parallelization

When most people first think about software designed to run on multiple cores such as Stata/MP, they think to themselves, two cores, twice as fast; four cores, four times as fast. They appreciate that reality will somehow intrude so that two cores won’t really be twice as fast as one, but they imagine the intrusion is something like friction and nothing that an intelligently placed drop of oil can’t improve.

In fact, something inherent intrudes. In any process to accomplish something—even physical processes—some parts may be able to to be performed in parallel, but there are invariably parts that just have to be performed one after the other. Anyone who cooks knows that you sometimes add some ingredients, cook a bit, and then add others, and cook some more. So it is, too, with calculating xt = f(xt-1) for t=1 to 100 and t0=1. Depending on the form of f(), sometimes there’s no alternative to calculating x1 = f(x0), then calculating x2 = f(x1), and so on.

In any calculation, some proportion p of the calculation can be parallelized and the remainder, 1-p, cannot. Consider a calculation that takes T hours if it were performed sequentially on a single core. If we had an infinite number of cores and the best possible implementation of the code in parallelized form, the execution time would fall to (1-p)T hours. The part that could be parallelized, which ordinarily would run in pT hours, would run in literally no time at all once split across an infinite number of cores, and that would still leave (1-p)T hours to go. This is known as Amdahl’s Law.

We can generalize this formula to computers with a finite number of cores, say n of them. The parallelizable part of the calculation, the part that would ordinarily run in pT hours, will run in pT/n. The unparallelizable part will still take (1-p)T hours, so we have

Tn = pT/n + (1-p)T

As n goes to infinity, Tn goes to (1-pT).

Stata/MP is pretty impressively parallelized. We achieve p of 0.8 or 0.9 in many cases. We do not claim to have hit the limits of what is possible, but in most cases, we believe we are very close to those limits. Most estimation commands have p above 0.9, and linear regression is actually above 0.99! This is explained in more detail along with percentage parallelization details for all Stata commands in the Stata/MP Performance Report.

Let’s figure out the value of having more cores. Consider a calculation that would ordinarily require T = 1 hour. With p=0.8 and 2 cores, run times would fall to 0.6 hours; With p=0.9, 0.55 hours. That is very close to what would be achieved even with p=1, which is not possible. For 4 cores, run times would fall to 0.4 (p=0.8) and 0.325 (p=0.9). That’s good, but no where near the hoped for 0.25 that we would observe if p were 1.

In fact, to get to 0.25, we need about 16 cores. With 16 cores, run times fall to 0.25 (p=0.8) and 0.15625 (p=0.9). Going to 32 cores improves run times just a little, to 0.225 (p=0.8) and 0.128125 (p=0.9). Going to 64 cores, we would get 0.2125 (p=0.8) and 0.11384615 (p=0.9). There’s little gain at all because all the cores in the world combined, and more, cannot reduce run times to below 0.2 (p=0.8) and 0.1 (p=0.9).

Stata/MP supports up to 64 cores. We could make a version that supports 128 cores, but it would be a lot of work even though we would not have to write even one line of code. The work would be in running the experiments to set the tuning parameters.

It turns out there are yet other ways in which reality intrudes. In addition to some calculations such as xt = f(xt-1) not being parallelizable at all, it’s an oversimplification to say any calculation is parallelizable because there are issues of granularity and of diseconomies of scale, two related, but different, problems.

Let’s start with granularity. Consider making the calculation xt = f(zt) for t = 1 to 100, and let’s do that by splitting on the subscript t. If we have n=2 cores, we’ll assign the calculation for t = 1 to 50 to one core, and for t=51 to 100 to another. If we have four cores, we’ll split t into four parts. Granularity concerns what happens when we move from n=100 to n=101 cores. This problem can be split into only 100 parallelizable parts and the minimum run time is therefore max(T/n, T/100) and not T/n, as we previously assumed.

All problems suffer from granularity. Diseconomies of scale is a related issue, and it strikes sooner than granularity. Many, but not all problems suffer from diseconomies of scale. Rather than calculating f(zt) for t = 1 to 100, let’s consider calculating the sum of f(zt) for t = 1 to 100. We’ll make this calculation in parallel in the same way as we made the previous calculation, by splitting on t. This time, however, each subprocess will report back to us the sum over the subrange. To obtain the overall sum, we will have to add sub-sums. So if we have n=2 cores, core 1 will calculate the sum over t = 1 to 50, core 2 will calculate the sum for t = 51 to 100, and then, the calculation having come back together, the master core will have to calculate the sum of two numbers. Adding two numbers can be done in a blink of an eye.

But what if we split the problem across 100 cores? We would get back 100 numbers which we would then have to sum. Moreover, what if the calculation of f(zt) is trivial? In that case, splitting the calculation among all 100 cores might result in run times that are nearly equal to what we would observe performing the calculation on just one core, even though splitting the calculation between two cores would nearly halve the execution time, and splitting among four would nearly quarter it!

So what’s the maximum number of cores over which we should split this problem? It depends on the relative execution times of f(zt) and the the combination operator to be performed on those results (addition in this case).

It is the diseconomies of scale problem that bit us in the early versions of Stata/MP, at least in beta testing. We did not adequately deal with the problem of splitting calculations among fewer cores than were available. Fixing that problem was a lot of work and, for your information, we are still working on it as hardware becomes available with more and more cores. The right way to address the issue is to have calculation-by-calculation tuning parameters, which we do. But it takes a lot of experimental work to determine the values of those tuning parameters, and the greater the number of cores, the more accurately the values need to be measured. We have the tuning parameters determined accurately enough for up to 64 cores, although there are one or two which we suspect we could improve even more. We would need to do a lot of experimentation, however, to ensure we have values adequate for 128 cores. The irony is that we would be doing that to make sure we don’t use them all except when problems are large enough!

In any case, I have seen articles predicting and in some cases, announcing, computers with hundreds of cores. For applications with p approaching 1, those are exciting announcements. In the world of statistical software, however, these announcements are exciting only for those running with immense datasets.

  • Nice use of numerical examples in this post. When understanding the behavior of such a function I also like to look at graphs next to such numerical examples. In this case I used the graph below:nnlocal T = 1.0nlocal p = 0.8nntwoway function Tn= `p’*`T’/x + (1-`p’)*`T’, ///n range(1 64) ylab(0(.2)1) ///n yline(`=(1-`p’)*`T”) ///n note(“T = `T’, p = `p'”) ///n xtitle(“number of cores”) /// n ytitle(“time”) n

  • Laszlo

    Does the 64-core limit (and the whole diseconomies of scale problem itself) mean that you will not utilize GPUs anytime soon? (I know you had problems with OpenCL on single-precision CPUs — but recent GPUs get you double precision…) I hoped that you had been mum about this only to embargo the big news about Stata 12’s killer feature.

  • Anonymous

    Use of multiple cores and use of General Purpose GPUs (GPGPUs) are very different things, which is to say, the 64-core limit is unrelated to use of GPGPUs. I emphasize again the current 64-core limit is due only to lack of testing to set tuning parameters. Making Stata work with more cores is no work at all; we just change a limit. Making Stata work well with with more cores is substantial work even though, as I said, no new code needs to be written.nnUse of GPGPUs can result in performance enhancements, and sometimes whopping performance enhancements, but they usually do not provide the performance enhancements provided by multiple cores. I want to be careful about generalizing. GPGPUs serve the specialized purpose of making vectorized calculations and some matrix calculations, and it is a restriction that the values on which the calculations are being made really are stored as vectors and matrices, which is to say, the values must be adjacent to one another. Multiple cores, on the other hand, can be used with any data structure and in any situation. Thus, whether multiple cores yield a benefit depends solely on the conceptual nature of of the problem. Whether GPGPUs yield a benefit depends on the conceptual nature of the problem, that it be a problem within the GPGPU’s repertoire, and that the data already be arranged in a way the GPGPU requires. Thus, even though a problem might lend itself to parallelization, it might not lend itself to GPGPU implementation.

  • tpy

    Are those different MP versions downward compatible? That is, if I use MP12 on a quad-core machine, will it be just as fast as using MP4 on the same quad-core machine?

  • Yes — Stata/MP automatically downscales itself to fit the number of cores on a given computer. There is no difference in performance between Stata/MP (4-core) on a 4-core computer and Stata/MP (12-core) on that same 4-core computer. Stata/MP (12-core) will of course perform faster given more than 4 cores.

  • flashrandom

    Fantastic explanation, thank you very much.

  • ashleyr

    I have written my own ado file, which spends most of its time in a forvalues loop. The loop uses either reg or sureg (which you folks have parallelized), but seems to run at much the same speed as in Stata 11. I’m guessing that the routine must be spending much of its compute time on drawnorm or correlate commands. Are these high on the list for parallelization? Also, is there any prospect of commands which will allow us to parallelize a forvalues loop?