Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python

Home > Programming > Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python

Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python

5 November 2020 Chuck Huber, Director of Statistical Outreach Go to comments

In my previous posts, I used the read_stata() method to read Stata datasets into pandas data frames. This works well when you want to read an entire Stata dataset into Python. But sometimes we wish to read a subset of the variables or observations, or both, from a Stata dataset into Python. In this post, I will introduce you to the Stata Function Interface (SFI) module and show you how to use it to read partial datasets into a pandas data frame.

If you are not familiar with Python, it may be helpful to read the first four posts in my Stata/Python Integration series before you read further.

Using the SFI module to move data from Stata to Python

The SFI is a Python module that allows you to pass information back and forth between Stata and Python. You can copy whole or partial datasets, data frames, local and global macros, scalars and matrices, and even global Mata matrices. There are far too many features to show you in one blog post. So today I am going to show you a feature that you are likely to use: reading partial Stata datasets into Python. We’ll explore more SFI features in future posts.

Let’s begin the code block below by using the auto dataset. Next, let’s enter the Python environment and import the Data class from the SFI module. Then, we will use the get() method in the Data class to copy the variable foreign into a Python list object named dataraw. The first argument of the get() method is a list of Stata variables placed in single quotes.

sysuse auto
python
from sfi import Data
dataraw = Data.get('foreign')
dataraw
end

The Python output shows us that the list object dataraw contains the data for the Stata variable foreign.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> dataraw = Data.get('foreign')
>>> dataraw
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
> , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> end
--------------------------------------------------------------------------------

Specify a range of observations

The second argument of the get() method allows us to specify a range of observations. I have used the range() function in the code block below to specify observations 46 through 56. Note that I have also added mpg and rep78 to the list of variables.

python
from sfi import Data
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56))
dataraw
end

The Python output shows the contents of the list object dataraw. The list contains sublists that each include three values. Each sublist is an observation from the Stata dataset and contains data for the variables foreign, mpg, and rep78. The number 8.98846567431158e+307 in the fifth observation is a missing value, and we will learn how to handle it below.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56))
>>> dataraw
[[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307],
>  [0, 24, 2], [1, 17, 5], [1, 23, 3], [1, 25, 4], [1, 23, 4]]
>>> end
--------------------------------------------------------------------------------

Specify observations using an indicator variable

The third argument of the get() method allows us to further restrict our data based on an indicator variable. In the example below, I have generated a new variable named touse that equals 1 if mpg is less than 20 and 0 otherwise. Then, I have specified “touse” as the third argument in get().

generate touse = mpg<20
python
from sfi import Data
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56),
                     "touse")
dataraw
end

The Python output shows that dataraw contains only observations where mpg is less than 20.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56),
...                      "touse")
>>> dataraw
[[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307],
>  [1, 17, 5]]
>>> end
--------------------------------------------------------------------------------

Get value labels rather than numbers

The values of the Stata variable foreign are labeled with “Domestic” for 0 and “Foreign” for 1.

. list foreign in 50/54

     +----------+
     |  foreign |
     |----------|
 50. | Domestic |
 51. | Domestic |
 52. | Domestic |
 53. |  Foreign |
 54. |  Foreign |
     +----------+

Our Python list object dataraw stores only the underlying numeric values 0 and 1, but we may prefer to work with the labels. The fourth argument of get() allows us to pass the value labels of a Stata variable to Python rather than the numbers. I have specified valuelabel=True in the code block below to pass the value labels to Python.

python
from sfi import Data
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56),
                     "touse",
                     valuelabel=True)
dataraw
end

The Python output below shows us that dataraw now contains the words “Domestic” and “Foreign”. Note that these are strings rather than labeled numeric values.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56),
...                      "touse",
...                      valuelabel=True,)
>>> dataraw
[['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19,
>  3], ['Domestic', 19, 8.98846567431158e+307], ['Foreign', 17, 5]]
>>> end
--------------------------------------------------------------------------------

Specify a number for missing values

The fifth argument of get() allows us to specify a value for missing data. Recall that Stata stores missing values as the largest possible value for a numeric storage type. The Stata variable rep78 is stored as a double-precision numeric variable that has a maximum value of 8.98846567431158e+307. Floating-point numeric variables have a maximum value of 1.70141173319e+38, long variables have a maximum value of 2,147,483,620, int variables have a maximum value of 32,740, and byte variables have a maximum value of 100. Thus, the precise value of a missing value depends on the storage type of the variable.

Python does not recognize these numbers as missing values. Python interprets 8.98846567431158e+307 as a number. Missing numeric values in Python are often represented with Numpy’s special floating-point value “nan”, which was first defined by the Institute of Electrical and Electronics Engineers in the IEEE 754-1985 Standards. We can tell Python that 8.98846567431158e+307 is “not a number” (nan) by specifying missingval=np.nan for the fifth argument of get().

python
from sfi import Data
import numpy as np
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56),
                     "touse",
                     valuelabel=True,
                     missingval=np.nan)
dataraw
end

The Python output below shows that the number 8.98846567431158e+307 in dataraw has been replaced with Numpy’s special floating-point value “nan”.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> import numpy as np
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56),
...                      "touse",
...                      valuelabel=True,
...                      missingval=np.nan)
>>> dataraw
[['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19,
>  3], ['Domestic', 19, nan], ['Foreign', 17, 5]]
>>> end
--------------------------------------------------------------------------------

Convert the list object to a pandas data frame

We have used get() to copy part of our Stata dataset into a Python list object named dataraw. Next, let’s convert our list object to a pandas data frame.

We begin by importing pandas using the alias pd. Then, we can create a data frame by typing dataframe = pd.DataFrame(dataraw).

python
from sfi import Data
import numpy as np
import pandas as pd
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56),
                     "touse",
                     valuelabel=True,
                     missingval=np.nan)
dataframe = pd.DataFrame(dataraw)
dataframe
end

The Python output below displays the data frame dataframe. The columns labeled 0, 1, and 2 are the variables foreign, mpg, and rep78, respectively. The unlabeled column on the left is an index that pandas created to uniquely identify each row.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> import numpy as np
>>> import pandas as pd
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56),
...                      "touse",
...                      valuelabel=True,
...                      missingval=np.nan)
>>> dataframe = pd.DataFrame(dataraw)
>>> dataframe
          0   1   2
0  Domestic  18   4
1  Domestic  18   1
2  Domestic  19   3
3  Domestic  19   3
4  Domestic  19 NaN
5   Foreign  17   5
>>> end
--------------------------------------------------------------------------------

Label the columns of a data frame

We can label the columns of our data frame using the columns option in the DataFrame() method. The list of column names must be enclosed in square brackets, and each column name must be enclosed in single quotes and separated by commas.

python
from sfi import Data
import numpy as np
import pandas as pd
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56),
                     "touse",
                     valuelabel=True,
                     missingval=np.nan)
dataframe = pd.DataFrame(dataraw,
                         columns=['foreign', 'mpg', 'rep78'])
dataframe
end

The Python output below shows that the second, third, and fourth columns in the data frame are now named foreign, mpg, and rep78, respectively.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> import numpy as np
>>> import pandas as pd
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56),
...                      "touse",
...                      valuelabel=True,
...                      missingval=np.nan)
>>> dataframe = pd.DataFrame(dataraw,
...                          columns=['foreign', 'mpg', 'rep78'])
>>> dataframe
    foreign  mpg  rep78
0  Domestic   18      4
1  Domestic   18      1
2  Domestic   19      3
3  Domestic   19      3
4  Domestic   19    NaN
5   Foreign   17      5
>>> end
--------------------------------------------------------------------------------

Begin the data frame index at 1

Python uses zero-based array indexing, which means that row and column counts begin with 0 rather than 1. So pandas automatically created a row index that begins with 0. You can skip to the next section if you are comfortable with the index beginning at zero or you do not plan to use the index. Or you can change the index to begin with 1 using the index option in the DataFrame() method.

We will specify the index using the arange() method in the Numpy module. The first argument is the first element of the row index, which is 1. The second argument is the last element of the row index. We could simply type 6 because there are 6 rows in our data frame. But this number could change the next time we run our code. We can use the len() method to calculate the length of the list object dataraw. And we must add 1 to the length of dataraw because Python begins counting at 0.

python
from sfi import Data
import numpy as np
import pandas as pd
dataraw   = Data.get('foreign mpg rep78',
                     range(46,56),
                     "touse",
                     valuelabel=True,
                     missingval=np.nan)
dataframe = pd.DataFrame(dataraw,
                         columns=['foreign', 'mpg', 'rep78'],
                         index=[np.arange(1, len(dataraw)+1)])
dataframe
end

The Python output below shows us that the index for dataframe now begins at 1 and ends with 6.

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> import pandas as pd
>>> import numpy as np
>>> dataraw   = Data.get('foreign mpg rep78',
...                      range(46,56),
...                      "touse",
...                      valuelabel=True,
...                      missingval=np.nan)
>>> dataframe = pd.DataFrame(dataraw,
...                          columns=['foreign', 'mpg', 'rep78'],
...                          index=[np.arange(1, len(dataraw)+1)])
>>> dataframe
    foreign  mpg  rep78
1  Domestic   18      4
2  Domestic   18      1
3  Domestic   19      3
4  Domestic   19      3
5  Domestic   19    NaN
6   Foreign   17      5
>>> end
--------------------------------------------------------------------------------

Using getAsDict()

You can also use getAsDict() to copy Stata data to a Python dictionary. The arguments are the same as get(), and the resulting dictionary contains the names of the Stata variables. This means that we don’t have to name the columns when we convert the dictionary to a data frame. Creating a data-frame index that begins with 1 is different because the length of a dictionary is not the number of Stata observations. In the code block below, I defined obs as the length of a list of the values in the dictionary dataraw. I used the next() and iter() functions to loop over the values in the dictionary dataraw. And I again added 1 because Python begins counting at 0.

python
from sfi import Data
import pandas as pd
import numpy as np
dataraw = Data.getAsDict('foreign mpg rep78',
                          range(46,56),
                          "touse",
                          valuelabel=True,
                          missingval=np.nan)
dataraw
obs = len(next(iter(dataraw.values()))) + 1
dataframe = pd.DataFrame(dataraw,
                         index=[np.arange(1, obs)])
dataframe
end

The Python output below shows that the resulting data frame looks much like the data frame we created using get().

. python
----------------------------------------------- python (type end to exit) ------
>>> from sfi import Data
>>> import pandas as pd
>>> import numpy as np
>>> dataraw = Data.getAsDict('foreign mpg rep78',
...                      range(46,56),
...                      "touse",
...                      valuelabel=True,
...                      missingval=np.nan)

>>> dataraw
{'foreign': ['Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Foreig
> n'], 'mpg': [18, 18, 19, 19, 19, 17], 'rep78': [4, 1, 3, 3, nan, 5]}

>>> obs = len(next(iter(dataraw.values()))) + 1
>>> dataframe = pd.DataFrame(dataraw,
...                          index=[np.arange(1, obs)])
>>> dataframe
    foreign  mpg  rep78
1  Domestic   18    4.0
2  Domestic   18    1.0
3  Domestic   19    3.0
4  Domestic   19    3.0
5  Domestic   19    NaN
6   Foreign   17    5.0
>>> end
--------------------------------------------------------------------------------

Just the basics

Perhaps you don’t wish to restrict your sample, and you don’t mind zero-based indexing. You just want to copy a collection of variables to a pandas data frame in Python. The code block below will do that and convert Stata missing values to Python missing values.

python
from sfi import Data
import pandas as pd
import numpy as np
dataraw = Data.getAsDict('foreign mpg rep78',
                         None,
                         None,
                         valuelabel=False,
                         missingval=np.nan)
dataframe = pd.DataFrame(dataraw)
dataframe
end

The output below displays the data frame dataframe, which is ready for graphing or data analysis.

>>> dataframe
    foreign  mpg  rep78
0         0   22    3.0
1         0   17    3.0
2         0   22    NaN
3         0   20    3.0
4         0   15    4.0
..      ...  ...    ...
69        1   23    4.0
70        1   41    5.0
71        1   25    4.0
72        1   25    4.0
73        1   17    5.0

[74 rows x 3 columns]

Conclusion

We did it! We used the get() and getAsDict() methods in the Data class of the SFI module to copy part of a Stata dataset to a Python data frame. We even accounted for missing data. And it’s easy to use get() and getAsDict() in our do-files, ado-files, and Python scripts anytime we want to incorporate Python into our data management, analysis, or reporting. Next time, I’ll show you how to use the SFI to copy data from Python into a Stata dataset.

Further reading

Di Russo, J. 2019. Navigating The Hell of NaNs in Python, Towards Data Science (blog), October 23, 2019.

Fung, K. 2020. Array Indexing: 0-based or 1-based?, Analytics Vidhya, January 26, 2020.

Categories: Programming Tags: Python, stocks, yahoo finance

Stata for Mac with Apple Silicon Stata/Python integration part 7: Machine learning with support vector machines

Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python

Subscribe to the Stata Blog

Recent articles

Archives

Categories

Links