Archive for the ‘Programming’ Category

Stata/Python integration part 7: Machine learning with support vector machines

Machine learning, deep learning, and artificial intelligence are a collection of algorithms used to identify patterns in data. These algorithms have exotic-sounding names like “random forests”, “neural networks”, and “spectral clustering”. In this post, I will show you how to use one of these algorithms called a “support vector machines” (SVM). I don’t have space to explain an SVM in detail, but I will provide some references for further reading at the end. I am going to give you a brief introduction and show you how to implement an SVM with Python.

Our goal is to use an SVM to differentiate between people who are likely to have diabetes and those who are not. We will use age and HbA1c level to differentiate between people with and without diabetes. Age is measured in years, and HbA1c is a blood test that measures glucose control. The graph below displays diabetics with red dots and nondiabetics with blue dots. An SVM model predicts that older people with higher levels of HbA1c in the red-shaded area of the graph are more likely to have diabetes. Younger people with lower HbA1c levels in the blue-shaded area are less likely to have diabetes. Read more…

Stata/Python integration part 6: Working with APIs and JSON data

Data are everywhere. Many government agencies, financial institutions, universities, and social media platforms provide access to their data through an application programming interface (API). APIs often return the requested data in a JavaScript Object Notation (JSON) file. In this post, I will show you how to use Python to request data with API calls and how to work with the resulting JSON data. Read more…

Categories: Programming Tags: , , , ,

Stata/Python integration part 5: Three-dimensional surface plots of marginal predictions

In my first four posts about Stata and Python, I showed you how to set up Stata to use Python, three ways to use Python in Stata, how to install Python packages, and how to use Python packages. It might be helpful to read those posts before you continue with this post if you are not familiar with Python. Now, I’d like to shift our focus to some practical uses of Python within Stata. This post will demonstrate how to use Stata to estimate marginal predictions from a logistic regression model and use Python to create a three-dimensional surface plot of those predictions.

Read more…

Stata/Python integration part 4: How to use Python packages

In my last post, I showed you how to use pip to install four popular packages for Python. Today I want to show you the basics of how to import and use Python packages. We will learn some important Python concepts and jargon along the way. I will be using the pandas package in the examples below, but the ideas and syntax are the same for other Python packages. Read more…

Categories: Programming Tags: ,

Stata/Python integration part 3: How to install Python packages

In my last post, I showed you three ways to use Python within Stata. The examples were simple but they allowed us to start using Python. At this point, you could write your own Python programs within Stata. But the real power of Python lies in the thousands of freely available packages. Today, I want to show you how to download and install Python packages. Read more…

Categories: Programming Tags: ,

Stata/Python integration part 2: Three ways to use Python in Stata

In my last post, I showed you how to install Python and set up Stata to use Python. Now, we’re ready to use Python. There are three ways to use Python within Stata: calling Python interactively, including Python code in do-files and ado-files, and executing Python script files. Each is useful in different circumstances, so I will demonstrate all three. The examples are intentionally simple and somewhat silly. I’ll show you some more complex examples in future posts, but I want to keep things simple in this post. Read more…

Categories: Programming Tags: ,

Stata/Python integration part 1: Setting up Stata to use Python

Python integration is one of the most exciting features in Stata 16. There are thousands of free Python packages that you can use to access and process data from the Internet, visualize data, explore data using machine-learning algorithms, and much more. You can use these Python packages interactively within Stata or incorporate Python code into your do-files. And there are a growing number of community-contributed commands that have familiar, Stata-style syntax that use Python packages as the computational engine. But there are a few things that we must do before we can use Python in Stata. This blog post will show you how to set up Stata to use Python. Read more…

Categories: Programming Tags: ,

Revealed preference: Stata for reproducible research

I care about reproducible research. Anyone who has ever been a research assistant or tried to follow the path set by other researchers also cares. Sometimes, reproducing others’ results is a frustrating task; sometimes, it is outright impossible. Yet sometimes, it is satisfyingly simple. In my experience, reproducing results is easy when it involves a Stata do-file. I believe this is true even beyond my personal bias (I work for Stata and used the software regularly before that). A recent article published by the American Economic Association (AEA), Vilhuber, Turrito, and Welch (2020), shows that Stata is the preferred package among economists, and I believe reproducibility is a big reason why. Read more…

Compatibility and reproducibility

I saw a tweet the other day where someone claimed that StataCorp ensures that the dataset format in Stata X is always different from Stata X-1.

This reminded me of an email I wrote a few years ago to a user who had questions about backward compatibility and reproducibility. I’m going to use large parts of that email in this blog post to share my thoughts on those topics.

I understand the frustration of incompatibilities between software versions. While it may not ease the inevitable difficulties that arise, I would like to explain our efforts in this regard. Read more…

How to automate common tasks

Automating common tasks is crucial to effective data analysis. Automation saves you lots of time from repeating the same sets of operations, and it reduces errors by reducing what you have to repeat.

Let’s automate something using Stata. The task we are automating doesn’t much matter. What matters is that we get comfortable with how to automate tasks.

We will automate the simple task of normalizing a variable. That is to say, subtracting the variable’s mean and dividing by its standard deviation.

Just so you know, there are already community-contributed commands to do this and to do it more flexibly than we will. Type search normalize variable in Stata, and you will see one of those commands. (You will see things about other types of normalization that have nothing to do with normalizing a variable, but the command of interest is easy to pick out.) You can also normalize a single variable using Stata’s egen command, but we are going to do more than that.

As with all the articles in this series, I assume the reader is new to automating tasks in Stata. So, if you are already an expert, these articles may hold little interest for you. Or perhaps you will still find something novel. Read more…