Home > Programming > Stata/Python integration part 9: Using the Stata Function Interface to copy data from Python to Stata

Stata/Python integration part 9: Using the Stata Function Interface to copy data from Python to Stata

In my previous post, we learned how to use the Stata Function Interface (SFI) module to copy data from Stata to Python. In this post, I will show you how to use the SFI module to copy data from Python to Stata. We will be using the yfinance module to download financial data from the Yahoo! finance website. You can install this module in your Python environment by typing pip install yfinance. Our goal is to use Python to download historical data for the Dow Jones Industrial Average (DJIA) and use Stata to create the following graph.

graph1

If you are not familiar with Python, it may be helpful to read the first four posts in my Stata/Python Integration series before you read further.

  1. Setting up Stata to use Python
  2. Three ways to use Python in Stata
  3. How to install Python packages
  4. How to use Python packages

Using the yfinance module to download Yahoo! finance data

Let’s begin by importing the yfinance module using the alias pd. Then, we will use the download() method to download data for the DJIA from the Yahoo! finance website. The download() method accepts three arguments. The first argument is the stock symbol for the DJIA. The second argument is the start date for the stock data, and the third argument is the end date.

python:
import yfinance as yf
dowjones = yf.download("^DJI", start="2010-01-01", end="2019-12-31")
dowjones
end

The download() method returns a Pandas data frame with 2,515 rows and 6 columns indexed by date.

. python:
----------------------------------------------- python (type end to exit) ------
>>> import yfinance as yf
>>> dowjones = yf.download("^DJI", start="2010-01-01", end="2019-12-31")
[*********************100%***********************]  1 of 1 completed
>>> dowjones
                    Open          High  ...     Adj Close     Volume
Date                                    ...
2010-01-04  10430.690430  10604.969727  ...  10583.959961  179780000
2010-01-05  10584.559570  10584.559570  ...  10572.019531  188540000
2010-01-06  10564.719727  10594.990234  ...  10573.679688  186040000
2010-01-07  10571.110352  10612.370117  ...  10606.860352  217390000
2010-01-08  10606.400391  10619.400391  ...  10618.190430  172710000
...                  ...           ...  ...           ...        ...
2019-12-23  28491.779297  28582.490234  ...  28551.529297  223530000
2019-12-24  28572.570312  28576.800781  ...  28515.449219   86150000
2019-12-26  28539.460938  28624.099609  ...  28621.390625  155970000
2019-12-27  28675.339844  28701.660156  ...  28645.259766  182280000
2019-12-30  28654.759766  28664.689453  ...  28462.140625  181600000

[2515 rows x 6 columns]
>>> end
--------------------------------------------------------------------------------

We can view the details of the index by typing dowjones.index in Python. The output below shows us that the index column is named ‘Date’, which is stored as the data type datetime64[ns].

. python:
----------------------------------------------- python (type end to exit) ------
>>> dowjones.index
DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
               '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
               '2010-01-14', '2010-01-15',
               ...
               '2019-12-16', '2019-12-17', '2019-12-18', '2019-12-19',
               '2019-12-20', '2019-12-23', '2019-12-24', '2019-12-26',
               '2019-12-27', '2019-12-30'],
              dtype='datetime64[ns]', name='Date', length=2515, freq=None)
>>> end
--------------------------------------------------------------------------------

We must convert the index to a string variable before we copy it to a Stata variable. We can use the astype() method to create a new column named dowdate that contains the index stored as a string.

. python:
----------------------------------------------- python (type end to exit) ------
>>> dowjones['dowdate'] = dowjones.index.astype(str)
>>> dowjones['dowdate']
Date
2010-01-04    2010-01-04
2010-01-05    2010-01-05
2010-01-06    2010-01-06
2010-01-07    2010-01-07
2010-01-08    2010-01-08
                 ...
2019-12-23    2019-12-23
2019-12-24    2019-12-24
2019-12-26    2019-12-26
2019-12-27    2019-12-27
2019-12-30    2019-12-30
Name: dowdate, Length: 2515, dtype: object
>>> end
--------------------------------------------------------------------------------

Use the SFI module to copy the Python data to Stata

Now, we’re ready to use the SFI module to copy the dowjones data frame to Stata. We begin by importing the Data class from the SFI module. Then, we can use the setObsTotal() method to add observations to the current Stata dataset. We must add enough observations to the Stata dataset to store all the rows of the dowjones data frame. We can count the number of rows using the
len() method.

. python:
----------------------------------------------- python (type end to exit) ------
... from sfi import Data
>>> Data.setObsTotal(len(dowjones))
>>> end
--------------------------------------------------------------------------------

Next, we will add three variables to the Stata dataset. The addVarStr() method adds a 10-character string variable named “dowdate”. The addVarDouble() method adds a double-precision variable named “dowclose”. And the addVarInt() method adds an int variable named “dowvolume”.

. python:
----------------------------------------------- python (type end to exit) ------
>>> Data.addVarStr("dowdate",10)
>>> Data.addVarDouble("dowclose")
>>> Data.addVarInt("dowvolume")
>>> end
--------------------------------------------------------------------------------

Now, we can use the store() method to copy columns from the Pandas data frame to variables in the Stata dataset. store() accepts four arguments. The first argument is the name of the Stata variable to which the Pandas column is to be copied. The second argument can be used to specify a subset of the data frame to be copied. For example, we could use the range() function to specify a range of rows to copy. I wish to copy all the observations so I have specified “None”. The third argument is the Python data that is to be copied to Stata. The fourth argument is the name of an indicator variable that specifies which rows in the Python data frame are to be copied to the Stata variable. This argument is optional. When it is not specified,
all the rows in the Python data frame are copied.

The first statement in the output below copies column dowdate from the Python data frame dowjones to the Stata variable dowdate. The second statement copies column Adj Close from the Python data frame dowjones to the Stata variable dowclose. The third statement in the output below copies column Volume from the Python data frame dowjones to the Stata variable dowvolume. Note that I have specified “None” for the second and fourth arguments because I wish to copy all the rows from the Pandas data frame.

. python:
----------------------------------------------- python (type end to exit) ------
>>> Data.store("dowdate", None, dowjones['dowdate'], None)
>>> Data.store("dowclose", None, dowjones['Adj Close'], None)
>>> Data.store("dowvolume", None, dowjones['Volume'], None)
>>> end
--------------------------------------------------------------------------------

Clean the Stata dataset

Let’s list the first five observations in our Stata dataset to check our work.

. list in 1/5, abbreviate(9)

     +-----------------------------------+
     |    dowdate   dowclose   dowvolume |
     |-----------------------------------|
  1. | 2010-01-04   10583.96     1.8e+08 |
  2. | 2010-01-05   10572.02     1.9e+08 |
  3. | 2010-01-06   10573.68     1.9e+08 |
  4. | 2010-01-07   10606.86     2.2e+08 |
  5. | 2010-01-08   10618.19     1.7e+08 |
     +-----------------------------------+

Our data look pretty good, but we have a few data management tasks to complete before we can graph the data. First, recall that dowdate is stored as a string. Let’s use Stata’s date() function to generate a new variable named date. The first argument is the string variable that is to be converted to a date. The second argument is the order of the year, month, and day in the first argument. I have specified “YMD” because the data in dowdate are stored with the year first, the month second, and the day third.

. generate date = date(dowdate,"YMD")

. list in 1/5, abbreviate(9)

     +-------------------------------------------+
     |    dowdate   dowclose   dowvolume    date |
     |-------------------------------------------|
  1. | 2010-01-04   10583.96   1.798e+08   18266 |
  2. | 2010-01-05   10572.02   1.885e+08   18267 |
  3. | 2010-01-06   10573.68   1.860e+08   18268 |
  4. | 2010-01-07   10606.86   2.174e+08   18269 |
  5. | 2010-01-08   10618.19   1.727e+08   18270 |
     +-------------------------------------------+

The data in date do not look like dates to you and me. The date() function returns the number of days since January 1, 1960. We can use format to display the data in a familiar date format.

. format %tdCCYY-NN-DD date

. list in 1/5, abbreviate(9)

     +------------------------------------------------+
     |    dowdate   dowclose   dowvolume         date |
     |------------------------------------------------|
  1. | 2010-01-04   10583.96   1.798e+08   2010-01-04 |
  2. | 2010-01-05   10572.02   1.885e+08   2010-01-05 |
  3. | 2010-01-06   10573.68   1.860e+08   2010-01-06 |
  4. | 2010-01-07   10606.86   2.174e+08   2010-01-07 |
  5. | 2010-01-08   10618.19   1.727e+08   2010-01-08 |
     +------------------------------------------------+

Next, dowvolume is displayed in scientific notation. We can again use format to display the numbers with commas in the thousands place.

. format %16.0fc dowvolume

. list in 1/5, abbreviate(9)

     +--------------------------------------------------+
     |    dowdate   dowclose     dowvolume         date |
     |--------------------------------------------------|
  1. | 2010-01-04   10583.96   179,780,000   2010-01-04 |
  2. | 2010-01-05   10572.02   188,540,000   2010-01-05 |
  3. | 2010-01-06   10573.68   186,040,000   2010-01-06 |
  4. | 2010-01-07   10606.86   217,390,000   2010-01-07 |
  5. | 2010-01-08   10618.19   172,710,000   2010-01-08 |
     +--------------------------------------------------+

It looks like dowvolume is rounded to the nearest 10,000. So let’s divide dowvolume by 1,000,000 and use format to display the data with two decimal places. Let’s also label dowvolume so that we don’t forget that the data are stored as “millions of shares”.

. replace dowvolume = dowvolume/1000000
(2,515 real changes made)

. format %10.2fc dowvolume

. label variable dowvolume "DJIA Volume (Millions of Shares)"

. list in 1/5, abbreviate(9)

     +------------------------------------------------+
     |    dowdate   dowclose   dowvolume         date |
     |------------------------------------------------|
  1. | 2010-01-04   10583.96      179.78   2010-01-04 |
  2. | 2010-01-05   10572.02      188.54   2010-01-05 |
  3. | 2010-01-06   10573.68      186.04   2010-01-06 |
  4. | 2010-01-07   10606.86      217.39   2010-01-07 |
  5. | 2010-01-08   10618.19      172.71   2010-01-08 |
     +------------------------------------------------+

Our data are formatted and ready to graph.

Graph the data

I’d like to create a graph that includes a line graph for dowclose and a bar chart for dowvolume. This is easy with graph twoway. We can use graph twoway line to create a line plot for dowclose and graph twoway bar for dowvolume. You may be familiar with many of the graphics options in the code block below. I have used title() to add a title and used xtitle(“”) and ytitle(“”) to remove the titles from the x and y axes, respectively. I have used xlabel and ylabel() to label the axes. And I have used legend() to format the legend.

The options yaxis(2) and axis(2) may be unfamiliar to you. The yaxis(2) option creates a second y axis on the right side of the graph for the bar chart. The axis(2) option in the ytitle() and ylabel() specify that those options apply to the second y axis. Note that I have plotted dowclose with a green line and labeled the left y axis with a green font to emphasize that the left axis describes the line graph. Similarly, I have I have plotted dowvolume with blue bars and labeled the right y axis with a blue font to emphasize that the right axis describes the bar graph.

twoway (line dowclose date, lcolor(green) lwidth(medium))         ///
       (bar dowvolume date, fcolor(blue) lcolor(blue) yaxis(2)),  ///
       title("Dow Jones Industrial Average (2010 - 2019)")        ///
       xtitle("") ytitle("") ytitle("", axis(2))                  ///
       xlabel(, labsize(small) angle(horizontal))                 ///
       ylabel(5000(5000)30000,                                    ///
              labsize(small) labcolor(green)                      ///
              angle(horizontal) format(%9.0fc))                   ///
       ylabel(0(500)3000,                                         ///
              labsize(small) labcolor(blue)                       ///
              angle(horizontal) axis(2))                          ///
       legend(order(1 "Closing Price" 2 "Volume (millions)")      ///
              cols(1) position(10) ring(0))

The code block above creates the following graph.
graph1

Conclusion

We did it! We used the Yahoo! finance module in Python to download data for the Dow Jones Industrial Average and import it to a Pandas data frame. We used the Stata Function Interface (SFI) module to create new variables in Stata and copy specific columns from the Pandas data frame to Stata variables. And we used graph twoway to create a graph with Stata. I have collected the examples below.

Next time, I will show you how to use the Stata Function Interface (SFI) module to copy scalars, macros, and matrices back and forth between Stata and Python.

example.do

python:
import yfinance as yf
from sfi import Data, Macro

# Use the yfinance module to download the DJIA data
dowjones = yf.download("^DJI", start="2010-01-01", end="2019-12-31")

# Display the column names
dowjones.columns      
# Display the name of the index
dowjones.index.name    
# Display the index data                          
dowjones.index   
# Copy the index to a string column                                
dowjones['dowdate'] = dowjones.index.astype(str) 
dowjones[['dowdate','Adj Close', 'Volume']]

# Set the number of observations in the Stata dataset
Data.setObsTotal(len(dowjones))  # Add observations to the current dataset

# Create the variables in the Stata dataset
Data.addVarStr("dowdate",10)
Data.addVarDouble("dowclose")
Data.addVarInt("dowvolume")

# Write the data from the Pandas data frame to the Stata variables
Data.store("dowdate", None, dowjones['dowdate'], None)
# This works too
#Data.store("dowdate", None, dowjones.index.astype(str), None) 
Data.store("dowclose", None, dowjones['Adj Close'], None)
Data.store("dowvolume", None, dowjones['Volume'], None)
end

// Use the date() function to create the variable date
generate date = date(dowdate,"YMD")
format %tdCCYY-NN-DD date
label var date "Date"

// Format the variable dowclose
format %10.2f dowclose
label var dowclose "DJIA Closing Price"

// Format the variable dowvolume
replace dowvolume = dowvolume/1000000
format %10.2fc dowvolume
label var dowvolume "DJIA Volume (Millions of Shares)"

list in 1/5, abbrev(9)

// Create the graph
twoway (line dowclose date, lcolor(green) lwidth(medium))         ///
       (bar dowvolume date, fcolor(blue) lcolor(blue) yaxis(2)),  ///
       title("Dow Jones Industrial Average (2010 - 2019)")        ///
       xtitle("") ytitle("") ytitle("", axis(2))                  ///
       xlabel(, labsize(small) angle(horizontal))                 ///
       ylabel(5000(5000)30000,                                    ///
              labsize(small) labcolor(green)                      ///
              angle(horizontal) format(%9.0fc))                   ///
       ylabel(0(500)3000,                                         ///
              labsize(small) labcolor(blue)                       ///
              angle(horizontal) axis(2))                          ///
       legend(order(1 "Closing Price" 2 "Volume (millions)")      ///
              cols(1) position(10) ring(0))
Categories: Programming Tags: , ,