Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python
In my previous posts, I used the read_stata() method to read Stata datasets into pandas data frames. This works well when you want to read an entire Stata dataset into Python. But sometimes we wish to read a subset of the variables or observations, or both, from a Stata dataset into Python. In this post, I will introduce you to the Stata Function Interface (SFI) module and show you how to use it to read partial datasets into a pandas data frame.
If you are not familiar with Python, it may be helpful to read the first four posts in my Stata/Python Integration series before you read further.
- Setting up Stata to use Python
- Three ways to use Python in Stata
- How to install Python packages
- How to use Python packages
Using the SFI module to move data from Stata to Python
The SFI is a Python module that allows you to pass information back and forth between Stata and Python. You can copy whole or partial datasets, data frames, local and global macros, scalars and matrices, and even global Mata matrices. There are far too many features to show you in one blog post. So today I am going to show you a feature that you are likely to use: reading partial Stata datasets into Python. We’ll explore more SFI features in future posts.
Let’s begin the code block below by using the auto dataset. Next, let’s enter the Python environment and import the Data class from the SFI module. Then, we will use the get() method in the Data class to copy the variable foreign into a Python list object named dataraw. The first argument of the get() method is a list of Stata variables placed in single quotes.
sysuse auto python from sfi import Data dataraw = Data.get('foreign') dataraw end
The Python output shows us that the list object dataraw contains the data for the Stata variable foreign.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign') >>> dataraw [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 > , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> end --------------------------------------------------------------------------------
Specify a range of observations
The second argument of the get() method allows us to specify a range of observations. I have used the range() function in the code block below to specify observations 46 through 56. Note that I have also added mpg and rep78 to the list of variables.
python from sfi import Data dataraw = Data.get('foreign mpg rep78', range(46,56)) dataraw end
The Python output shows the contents of the list object dataraw. The list contains sublists that each include three values. Each sublist is an observation from the Stata dataset and contains data for the variables foreign, mpg, and rep78. The number 8.98846567431158e+307 in the fifth observation is a missing value, and we will learn how to handle it below.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56)) >>> dataraw [[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307], > [0, 24, 2], [1, 17, 5], [1, 23, 3], [1, 25, 4], [1, 23, 4]] >>> end --------------------------------------------------------------------------------
Specify observations using an indicator variable
The third argument of the get() method allows us to further restrict our data based on an indicator variable. In the example below, I have generated a new variable named touse that equals 1 if mpg is less than 20 and 0 otherwise. Then, I have specified “touse” as the third argument in get().
generate touse = mpg<20 python from sfi import Data dataraw = Data.get('foreign mpg rep78', range(46,56), "touse") dataraw end
The Python output shows that dataraw contains only observations where mpg is less than 20.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse") >>> dataraw [[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307], > [1, 17, 5]] >>> end --------------------------------------------------------------------------------
Get value labels rather than numbers
The values of the Stata variable foreign are labeled with “Domestic” for 0 and “Foreign” for 1.
. list foreign in 50/54 +----------+ | foreign | |----------| 50. | Domestic | 51. | Domestic | 52. | Domestic | 53. | Foreign | 54. | Foreign | +----------+
Our Python list object dataraw stores only the underlying numeric values 0 and 1, but we may prefer to work with the labels. The fourth argument of get() allows us to pass the value labels of a Stata variable to Python rather than the numbers. I have specified valuelabel=True in the code block below to pass the value labels to Python.
python from sfi import Data dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True) dataraw end
The Python output below shows us that dataraw now contains the words “Domestic” and “Foreign”. Note that these are strings rather than labeled numeric values.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True,) >>> dataraw [['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19, > 3], ['Domestic', 19, 8.98846567431158e+307], ['Foreign', 17, 5]] >>> end --------------------------------------------------------------------------------
Specify a number for missing values
The fifth argument of get() allows us to specify a value for missing data. Recall that Stata stores missing values as the largest possible value for a numeric storage type. The Stata variable rep78 is stored as a double-precision numeric variable that has a maximum value of 8.98846567431158e+307. Floating-point numeric variables have a maximum value of 1.70141173319e+38, long variables have a maximum value of 2,147,483,620, int variables have a maximum value of 32,740, and byte variables have a maximum value of 100. Thus, the precise value of a missing value depends on the storage type of the variable.
Python does not recognize these numbers as missing values. Python interprets 8.98846567431158e+307 as a number. Missing numeric values in Python are often represented with Numpy’s special floating-point value “nan”, which was first defined by the Institute of Electrical and Electronics Engineers in the IEEE 754-1985 Standards. We can tell Python that 8.98846567431158e+307 is “not a number” (nan) by specifying missingval=np.nan for the fifth argument of get().
python from sfi import Data import numpy as np dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataraw end
The Python output below shows that the number 8.98846567431158e+307 in dataraw has been replaced with Numpy’s special floating-point value “nan”.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import numpy as np >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataraw [['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19, > 3], ['Domestic', 19, nan], ['Foreign', 17, 5]] >>> end --------------------------------------------------------------------------------
Convert the list object to a pandas data frame
We have used get() to copy part of our Stata dataset into a Python list object named dataraw. Next, let’s convert our list object to a pandas data frame.
We begin by importing pandas using the alias pd. Then, we can create a data frame by typing dataframe = pd.DataFrame(dataraw).
python from sfi import Data import numpy as np import pandas as pd dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw) dataframe end
The Python output below displays the data frame dataframe. The columns labeled 0, 1, and 2 are the variables foreign, mpg, and rep78, respectively. The unlabeled column on the left is an index that pandas created to uniquely identify each row.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import numpy as np >>> import pandas as pd >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataframe = pd.DataFrame(dataraw) >>> dataframe 0 1 2 0 Domestic 18 4 1 Domestic 18 1 2 Domestic 19 3 3 Domestic 19 3 4 Domestic 19 NaN 5 Foreign 17 5 >>> end --------------------------------------------------------------------------------
Label the columns of a data frame
We can label the columns of our data frame using the columns option in the DataFrame() method. The list of column names must be enclosed in square brackets, and each column name must be enclosed in single quotes and separated by commas.
python from sfi import Data import numpy as np import pandas as pd dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw, columns=['foreign', 'mpg', 'rep78']) dataframe end
The Python output below shows that the second, third, and fourth columns in the data frame are now named foreign, mpg, and rep78, respectively.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import numpy as np >>> import pandas as pd >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataframe = pd.DataFrame(dataraw, ... columns=['foreign', 'mpg', 'rep78']) >>> dataframe foreign mpg rep78 0 Domestic 18 4 1 Domestic 18 1 2 Domestic 19 3 3 Domestic 19 3 4 Domestic 19 NaN 5 Foreign 17 5 >>> end --------------------------------------------------------------------------------
Begin the data frame index at 1
Python uses zero-based array indexing, which means that row and column counts begin with 0 rather than 1. So pandas automatically created a row index that begins with 0. You can skip to the next section if you are comfortable with the index beginning at zero or you do not plan to use the index. Or you can change the index to begin with 1 using the index option in the DataFrame() method.
We will specify the index using the arange() method in the Numpy module. The first argument is the first element of the row index, which is 1. The second argument is the last element of the row index. We could simply type 6 because there are 6 rows in our data frame. But this number could change the next time we run our code. We can use the len() method to calculate the length of the list object dataraw. And we must add 1 to the length of dataraw because Python begins counting at 0.
python from sfi import Data import numpy as np import pandas as pd dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw, columns=['foreign', 'mpg', 'rep78'], index=[np.arange(1, len(dataraw)+1)]) dataframe end
The Python output below shows us that the index for dataframe now begins at 1 and ends with 6.
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import pandas as pd >>> import numpy as np >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataframe = pd.DataFrame(dataraw, ... columns=['foreign', 'mpg', 'rep78'], ... index=[np.arange(1, len(dataraw)+1)]) >>> dataframe foreign mpg rep78 1 Domestic 18 4 2 Domestic 18 1 3 Domestic 19 3 4 Domestic 19 3 5 Domestic 19 NaN 6 Foreign 17 5 >>> end --------------------------------------------------------------------------------
Using getAsDict()
You can also use getAsDict() to copy Stata data to a Python dictionary. The arguments are the same as get(), and the resulting dictionary contains the names of the Stata variables. This means that we don’t have to name the columns when we convert the dictionary to a data frame. Creating a data-frame index that begins with 1 is different because the length of a dictionary is not the number of Stata observations. In the code block below, I defined obs as the length of a list of the values in the dictionary dataraw. I used the next() and iter() functions to loop over the values in the dictionary dataraw. And I again added 1 because Python begins counting at 0.
python from sfi import Data import pandas as pd import numpy as np dataraw = Data.getAsDict('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataraw obs = len(next(iter(dataraw.values()))) + 1 dataframe = pd.DataFrame(dataraw, index=[np.arange(1, obs)]) dataframe end
The Python output below shows that the resulting data frame looks much like the data frame we created using get().
. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import pandas as pd >>> import numpy as np >>> dataraw = Data.getAsDict('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataraw {'foreign': ['Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Foreig > n'], 'mpg': [18, 18, 19, 19, 19, 17], 'rep78': [4, 1, 3, 3, nan, 5]} >>> obs = len(next(iter(dataraw.values()))) + 1 >>> dataframe = pd.DataFrame(dataraw, ... index=[np.arange(1, obs)]) >>> dataframe foreign mpg rep78 1 Domestic 18 4.0 2 Domestic 18 1.0 3 Domestic 19 3.0 4 Domestic 19 3.0 5 Domestic 19 NaN 6 Foreign 17 5.0 >>> end --------------------------------------------------------------------------------
Just the basics
Perhaps you don’t wish to restrict your sample, and you don’t mind zero-based indexing. You just want to copy a collection of variables to a pandas data frame in Python. The code block below will do that and convert Stata missing values to Python missing values.
python from sfi import Data import pandas as pd import numpy as np dataraw = Data.getAsDict('foreign mpg rep78', None, None, valuelabel=False, missingval=np.nan) dataframe = pd.DataFrame(dataraw) dataframe end
The output below displays the data frame dataframe, which is ready for graphing or data analysis.
>>> dataframe foreign mpg rep78 0 0 22 3.0 1 0 17 3.0 2 0 22 NaN 3 0 20 3.0 4 0 15 4.0 .. ... ... ... 69 1 23 4.0 70 1 41 5.0 71 1 25 4.0 72 1 25 4.0 73 1 17 5.0 [74 rows x 3 columns]
Conclusion
We did it! We used the get() and getAsDict() methods in the Data class of the SFI module to copy part of a Stata dataset to a Python data frame. We even accounted for missing data. And it’s easy to use get() and getAsDict() in our do-files, ado-files, and Python scripts anytime we want to incorporate Python into our data management, analysis, or reporting. Next time, I’ll show you how to use the SFI to copy data from Python into a Stata dataset.
Further reading
Di Russo, J. 2019. Navigating The Hell of NaNs in Python, Towards Data Science (blog), October 23, 2019.
Fung, K. 2020. Array Indexing: 0-based or 1-based?, Analytics Vidhya, January 26, 2020.