Home > Hardware, Memory, Multiprocessing, Performance > Stata in the Cloud

Stata in the Cloud

As more organizations move their IT, data management, and data analysis needs to the Cloud, I often have to answer these questions:

  1. Can Stata run in the Cloud?
  2. Am I allowed to run my copy of Stata in the Cloud?
  3. What is the best setup for Stata in the Cloud?
  4. How does Stata perform in the Cloud?

Before I answer these questions, let’s define what cloud computing is. Wikipedia defines cloud computing as the following:

“Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet.”

The main reason I see our users use cloud computing is so they can easily add more computing resources (memory and cores) to projects they are working on to speed up development and analytics. What’s nice about cloud services is that they provide an easy way to add resources on demand. Basically, you pay for hardware resources only when you need them, which saves time and money and allows you to scale different projects accordingly.

Now let’s talk cloud platforms. The main two platforms I see our users using are Amazon Web Services and Microsoft Azure. There are other platforms, but these are the main platforms I hear questions about.

So, can Stata run on the Cloud? Yes, Stata can. Most cloud computers are virtual machines running Linux or Windows operating systems, and Stata runs on both. Now, which flavor of Stata should you use, IC, SE, or MP? I definitely recommend using Stata/MP on the Cloud if you are working with large datasets and the Stata commands you wish to use are highly parallelized. To see a list of all commands that have been sped up and by how much, see the Stata/MP Performance Report.

Users often ask if they are allowed to use their Stata license on the Cloud. The answer is absolutely. We draw no distinction between a workstation or server on-premises, a virtual machine on-premises, and an equivalent virtual machine on the Cloud. Your Stata license is yours to use on any computer you wish—real, virtual, or virtual on the Cloud.

Question three is a little harder to answer. The best setup largely depends on your specific needs. Some questions you will need to answer are these:

  1. What operating system are you or your users comfortable using?
  2. What is the typical size of data your organization will be working with?
  3. How many cores and how much memory are you going to allocate in the Cloud?
  4. How many users will be accessing this Cloud virtual machine at the same time?

Note that these questions aren’t Cloud specific and really apply to any setup, Cloud or on-premises, where resources are shared between users. The last question is an important one. Once your Cloud (or on-premises) machine has multiple users using Stata simultaneously, you must make sure you have a big enough machine with enough memory and cores for all the users. For example, if you have a Stata/MP 4-core 2-user license, you will want to have a Cloud machine with at least 8 cores allocated to it, 4 cores for each Stata user. Or you will want to spin up multiple cloud instances, giving users their own virtual machines.

The next consideration is memory. If the users are each working with a Stata dataset 5 GB in size, you will need at least 16 GB of RAM allocated to the Cloud machine, 10 GB of RAM for the data in memory and a bit more for overhead for the operating system to run. Or you could allocate two Cloud machines with 8 GB of RAM each.

The most frequent issue I hear about people using Stata on the Cloud is that users sometimes compete for RAM because several users are trying to load large datasets into RAM at the same time on the same computer. The easiest way around this is to use the Cloud the way it was designed—spool up multiple virtual computers to scale the load. It is also easy to train your Stata users to use memory efficiently. The way to do this is to get them to load only the variables they need to analyze from the dataset in Stata’s memory space and not to blindly bring the entire dataset into memory. For example, let’s say your user is working with a U.S. Census dataset that contains 20,000 variables, but the user really cares to analyze only 100 of those variables. Stata has the ability to load just the variables you need from a Stata dataset with the use command.

If you are unsure of which variables to load or need to search for the exact variables to load, you can use Stata 16’s use GUI to easily search for variables. See the video below to see how.

Once you have the exact use command, copy the command to a do-file, and save it for future data loading.

The final question, about how well Stata performs in the Cloud, again depends on the same issues discussed above. And it is no different from asking the same question about how Stata performs on an on-premises computer.

What is the typical size of datasets your organization will be working with? What type of Cloud virtual machines are you using, how many cores, and how much memory are you going to allocate to it? How many users will be accessing this Cloud virtual machine at the same time? What Stata commands and models are you using? The Cloud providers publish the specifications of the virtual machine instances you can use, and Stata will perform on them just as it would on equivalent physical machines.

The size of data, the resources allocated, and the number of people using the resources simultaneously are going to be the main issues to consider when building your environment.

If you have any questions on this subject, feel free to post in the comments or ask me on Twitter.