## Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python

In my previous posts, I used the **read_stata()** method to read Stata datasets into pandas data frames. This works well when you want to read an entire Stata dataset into Python. But sometimes we wish to read a subset of the variables or observations, or both, from a Stata dataset into Python. In this post, I will introduce you to the Stata Function Interface (SFI) module and show you how to use it to read partial datasets into a pandas data frame.

If you are not familiar with Python, it may be helpful to read the first four posts in my Stata/Python Integration series before you read further.

- Setting up Stata to use Python
- Three ways to use Python in Stata
- How to install Python packages
- How to use Python packages

__Using the SFI module to move data from Stata to Python__

The SFI is a Python module that allows you to pass information back and forth between Stata and Python. You can copy whole or partial datasets, data frames, local and global macros, scalars and matrices, and even global Mata matrices. There are far too many features to show you in one blog post. So today I am going to show you a feature that you are likely to use: reading partial Stata datasets into Python. We’ll explore more SFI features in future posts.

Let’s begin the code block below by using the auto dataset. Next, let’s enter the Python environment and import the **Data** class from the SFI module. Then, we will use the **get()** method in the **Data** class to copy the variable **foreign** into a Python list object named **dataraw**. The first argument of the **get()** method is a list of Stata variables placed in single quotes.

sysuse auto python from sfi import Data dataraw = Data.get('foreign') dataraw end

The Python output shows us that the list object **dataraw** contains the data for the Stata variable **foreign**.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign') >>> dataraw [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 > , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> end --------------------------------------------------------------------------------

__Specify a range of observations__

The second argument of the **get()** method allows us to specify a range of observations. I have used the **range()** function in the code block below to specify observations 46 through 56. Note that I have also added **mpg** and **rep78** to the list of variables.

python from sfi import Data dataraw = Data.get('foreign mpg rep78', range(46,56)) dataraw end

The Python output shows the contents of the list object **dataraw**. The list contains sublists that each include three values. Each sublist is an observation from the Stata dataset and contains data for the variables **foreign**, **mpg**, and **rep78**. The number **8.98846567431158e+307** in the fifth observation is a missing value, and we will learn how to handle it below.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56)) >>> dataraw [[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307], > [0, 24, 2], [1, 17, 5], [1, 23, 3], [1, 25, 4], [1, 23, 4]] >>> end --------------------------------------------------------------------------------

__Specify observations using an indicator variable__

The third argument of the **get()** method allows us to further restrict our data based on an indicator variable. In the example below, I have **generate**d a new variable named **touse** that equals 1 if **mpg** is less than 20 and 0 otherwise. Then, I have specified **“touse”** as the third argument in **get()**.

generate touse = mpg<20 python from sfi import Data dataraw = Data.get('foreign mpg rep78', range(46,56), "touse") dataraw end

The Python output shows that **dataraw** contains only observations where **mpg** is less than 20.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse") >>> dataraw [[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307], > [1, 17, 5]] >>> end --------------------------------------------------------------------------------

__Get value labels rather than numbers__

The values of the Stata variable **foreign** are labeled with “Domestic” for 0 and “Foreign” for 1.

. list foreign in 50/54 +----------+ | foreign | |----------| 50. | Domestic | 51. | Domestic | 52. | Domestic | 53. | Foreign | 54. | Foreign | +----------+

Our Python list object **dataraw** stores only the underlying numeric values 0 and 1, but we may prefer to work with the labels. The fourth argument of **get()** allows us to pass the value labels of a Stata variable to Python rather than the numbers. I have specified **valuelabel=True** in the code block below to pass the value labels to Python.

python from sfi import Data dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True) dataraw end

The Python output below shows us that **dataraw** now contains the words “Domestic” and “Foreign”. Note that these are strings rather than labeled numeric values.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True,) >>> dataraw [['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19, > 3], ['Domestic', 19, 8.98846567431158e+307], ['Foreign', 17, 5]] >>> end --------------------------------------------------------------------------------

__Specify a number for missing values__

The fifth argument of **get()** allows us to specify a value for missing data. Recall that Stata stores missing values as the largest possible value for a numeric storage type. The Stata variable **rep78** is stored as a double-precision numeric variable that has a maximum value of 8.98846567431158e+307. Floating-point numeric variables have a maximum value of 1.70141173319e+38, long variables have a maximum value of 2,147,483,620, int variables have a maximum value of 32,740, and byte variables have a maximum value of 100. Thus, the precise value of a missing value depends on the storage type of the variable.

Python does not recognize these numbers as missing values. Python interprets 8.98846567431158e+307 as a number. Missing numeric values in Python are often represented with Numpy’s special floating-point value “nan”, which was first defined by the Institute of Electrical and Electronics Engineers in the IEEE 754-1985 Standards. We can tell Python that 8.98846567431158e+307 is “not a number” (nan) by specifying **missingval=np.nan** for the fifth argument of **get()**.

python from sfi import Data import numpy as np dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataraw end

The Python output below shows that the number 8.98846567431158e+307 in **dataraw** has been replaced with Numpy’s special floating-point value “nan”.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import numpy as np >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataraw [['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19, > 3], ['Domestic', 19, nan], ['Foreign', 17, 5]] >>> end --------------------------------------------------------------------------------

__Convert the list object to a pandas data frame__

We have used **get()** to copy part of our Stata dataset into a Python list object named **dataraw**. Next, let’s convert our list object to a pandas data frame.

We begin by importing pandas using the alias pd. Then, we can create a data frame by typing **dataframe = pd.DataFrame(dataraw)**.

python from sfi import Data import numpy as np import pandas as pd dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw) dataframe end

The Python output below displays the data frame **dataframe**. The columns labeled 0, 1, and 2 are the variables **foreign**, **mpg**, and **rep78**, respectively. The unlabeled column on the left is an index that pandas created to uniquely identify each row.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import numpy as np >>> import pandas as pd >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataframe = pd.DataFrame(dataraw) >>> dataframe 0 1 2 0 Domestic 18 4 1 Domestic 18 1 2 Domestic 19 3 3 Domestic 19 3 4 Domestic 19 NaN 5 Foreign 17 5 >>> end --------------------------------------------------------------------------------

__Label the columns of a data frame__

We can label the columns of our data frame using the **columns** option in the **DataFrame()** method. The list of column names must be enclosed in square brackets, and each column name must be enclosed in single quotes and separated by commas.

python from sfi import Data import numpy as np import pandas as pd dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw, columns=['foreign', 'mpg', 'rep78']) dataframe end

The Python output below shows that the second, third, and fourth columns in the data frame are now named **foreign**, **mpg**, and **rep78**, respectively.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import numpy as np >>> import pandas as pd >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataframe = pd.DataFrame(dataraw, ... columns=['foreign', 'mpg', 'rep78']) >>> dataframe foreign mpg rep78 0 Domestic 18 4 1 Domestic 18 1 2 Domestic 19 3 3 Domestic 19 3 4 Domestic 19 NaN 5 Foreign 17 5 >>> end --------------------------------------------------------------------------------

__Begin the data frame index at 1__

Python uses zero-based array indexing, which means that row and column counts begin with 0 rather than 1. So pandas automatically created a row index that begins with 0. You can skip to the next section if you are comfortable with the index beginning at zero or you do not plan to use the index. Or you can change the index to begin with 1 using the **index** option in the **DataFrame()** method.

We will specify the index using the **arange()** method in the Numpy module. The first argument is the first element of the row index, which is 1. The second argument is the last element of the row index. We could simply type 6 because there are 6 rows in our data frame. But this number could change the next time we run our code. We can use the **len()** method to calculate the length of the list object **dataraw**. And we must add 1 to the length of **dataraw** because Python begins counting at 0.

python from sfi import Data import numpy as np import pandas as pd dataraw = Data.get('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw, columns=['foreign', 'mpg', 'rep78'], index=[np.arange(1, len(dataraw)+1)]) dataframe end

The Python output below shows us that the index for **dataframe** now begins at 1 and ends with 6.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import pandas as pd >>> import numpy as np >>> dataraw = Data.get('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataframe = pd.DataFrame(dataraw, ... columns=['foreign', 'mpg', 'rep78'], ... index=[np.arange(1, len(dataraw)+1)]) >>> dataframe foreign mpg rep78 1 Domestic 18 4 2 Domestic 18 1 3 Domestic 19 3 4 Domestic 19 3 5 Domestic 19 NaN 6 Foreign 17 5 >>> end --------------------------------------------------------------------------------

__Using getAsDict()__

You can also use **getAsDict()** to copy Stata data to a Python dictionary. The arguments are the same as **get()**, and the resulting dictionary contains the names of the Stata variables. This means that we don’t have to name the columns when we convert the dictionary to a data frame. Creating a data-frame index that begins with 1 is different because the length of a dictionary is not the number of Stata observations. In the code block below, I defined **obs** as the length of a list of the values in the dictionary **dataraw**. I used the **next()** and **iter()** functions to loop over the values in the dictionary **dataraw**. And I again added 1 because Python begins counting at 0.

python from sfi import Data import pandas as pd import numpy as np dataraw = Data.getAsDict('foreign mpg rep78', range(46,56), "touse", valuelabel=True, missingval=np.nan) dataraw obs = len(next(iter(dataraw.values()))) + 1 dataframe = pd.DataFrame(dataraw, index=[np.arange(1, obs)]) dataframe end

The Python output below shows that the resulting data frame looks much like the data frame we created using **get()**.

. python ----------------------------------------------- python (type end to exit) ------ >>> from sfi import Data >>> import pandas as pd >>> import numpy as np >>> dataraw = Data.getAsDict('foreign mpg rep78', ... range(46,56), ... "touse", ... valuelabel=True, ... missingval=np.nan) >>> dataraw {'foreign': ['Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Foreig > n'], 'mpg': [18, 18, 19, 19, 19, 17], 'rep78': [4, 1, 3, 3, nan, 5]} >>> obs = len(next(iter(dataraw.values()))) + 1 >>> dataframe = pd.DataFrame(dataraw, ... index=[np.arange(1, obs)]) >>> dataframe foreign mpg rep78 1 Domestic 18 4.0 2 Domestic 18 1.0 3 Domestic 19 3.0 4 Domestic 19 3.0 5 Domestic 19 NaN 6 Foreign 17 5.0 >>> end --------------------------------------------------------------------------------

__Just the basics__

Perhaps you don’t wish to restrict your sample, and you don’t mind zero-based indexing. You just want to copy a collection of variables to a pandas data frame in Python. The code block below will do that and convert Stata missing values to Python missing values.

python from sfi import Data import pandas as pd import numpy as np dataraw = Data.getAsDict('foreign mpg rep78', None, None, valuelabel=False, missingval=np.nan) dataframe = pd.DataFrame(dataraw) dataframe end

The output below displays the data frame **dataframe**, which is ready for graphing or data analysis.

>>> dataframe foreign mpg rep78 0 0 22 3.0 1 0 17 3.0 2 0 22 NaN 3 0 20 3.0 4 0 15 4.0 .. ... ... ... 69 1 23 4.0 70 1 41 5.0 71 1 25 4.0 72 1 25 4.0 73 1 17 5.0 [74 rows x 3 columns]

__Conclusion__

We did it! We used the **get()** and **getAsDict()** methods in the **Data** class of the SFI module to copy part of a Stata dataset to a Python data frame. We even accounted for missing data. And it’s easy to use **get()** and **getAsDict()** in our do-files, ado-files, and Python scripts anytime we want to incorporate Python into our data management, analysis, or reporting. Next time, I’ll show you how to use the SFI to copy data from Python into a Stata dataset.

__Further reading__

Di Russo, J. 2019. Navigating The Hell of NaNs in Python, *Towards Data Science* (blog), October 23, 2019.

Fung, K. 2020. Array Indexing: 0-based or 1-based?, Analytics Vidhya, January 26, 2020.