Stata/Python integration part 6: Working with APIs and JSON data
Data are everywhere. Many government agencies, financial institutions, universities, and social media platforms provide access to their data through an application programming interface (API). APIs often return the requested data in a JavaScript Object Notation (JSON) file. In this post, I will show you how to use Python to request data with API calls and how to work with the resulting JSON data.
If you are not familiar with Python, it may be helpful to read the first four posts in my Stata/Python Integration series before you read further.
- Setting up Stata to use Python
- Three ways to use Python in Stata
- How to install Python packages
- How to use Python packages
APIs and JSON data
An API is a software application that can be used to request data from another computing system. There are many different kinds of APIs, and the syntax is often unique to each API. But a typical API consists of a URL followed by query options. For example, the URL below uses the openFDA API to request data about adverse drug events from the United States Food and Drug Administration.
https://api.fda.gov/drug/event.json?
We can add options to our API call to narrow our data request. For example, the URL below requests the number of adverse events in the United States that involved Fentanyl from January 1, 2018, through January 5, 2018.
https://api.fda.gov/drug/event.json?search=receivedate [20180101+TO+20180105]+AND+occurcountry:"US" +AND+patient.drug.openfda.brand_name:"Fentanyl"&count=receivedate
We can type the URL for this API call in the address bar of a web browser, and the browser will display the resulting data as a JSON file.
JSON is a popular data file format that consists of a collection of key:value pairs. A “key” is similar to a variable in a Stata dataset, and a “value” is the data. The image above includes the key:value pair time:“20180105” near the bottom. The key is time and the value is “20180105”.
JSON data are often nested. For example, time:“20180105” is nested within the 4: key. In the image above, the time: key is nested within the 4: key, and the 4: key is nested within the results: key. There are two keys at the top of the JSON nesting structure: meta: and results:.
Our goal is to use the openFDA API to request data about adverse drug events and convert the nested JSON data to a Stata dataset. We will be using the requests and pandas packages, so you should check that they are installed before we begin.
Define the URL for an API call
Let’s begin by defining a string named URL within a Python code block. The string URL contains the URL to request adverse event data using the openFDA API.
python: URL = 'https://api.fda.gov/drug/event.json' URL end
The statement URL in the third line of the code block above displays the contents of the string URL.
. python: ---------------------------------------- python (type end to exit) ------------- >>> URL = 'https://api.fda.gov/drug/event.json' >>> URL 'https://api.fda.gov/drug/event.json' >>> end --------------------------------------------------------------------------------
We can customize our data request by adding search fields and values to our API call. You can see the syntax, a list of searchable fields, and examples on the openFDA website.
Let’s restrict our search to adverse events reported from January 1, 2018, through January 5, 2018, by adding the ?search option to our API call in the code block below.
python: URL = 'https://api.fda.gov/drug/event.json > ?search=receivedate:[20180101+TO+20180105]' URL end
The syntax works, but we have added only one search field, and the URL for the API call is already difficult to read. Let’s split the URL into two strings: API and date. The string API contains the URL for the basic API call, and the string date narrows our search from January 1, 2018, through January 5, 2018. Then, we can combine the strings by typing URL = API + date.
python: API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' URL = API + date URL end
Our Python code block is easier to read, and the URL for the API call is still the same.
. python: -------------------------------------------- python (type end to exit) -------- >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> URL = API + date >>> URL 'https://api.fda.gov/drug/event.json?search=receivedate:[20180101+TO+20180105]' >>> end -------------------------------------------------------------------------------
Let’s further restrict our query to adverse events that occurred in the United States from January 1, 2018, through January 5, 2018. In the code block below, the string country contains syntax that restricts our query to the United States. Then, we can combine the strings API, date, and country to specify the complete API call stored in the string URL. Note that we must include “+AND+” between date and country when we define URL.
python: API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' URL = API + date + "+AND+" + country URL end
Our code remains easy to read even as the URL for our API call becomes more complex.
. python: -------------------------------------------- python (type end to exit) -------- >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> country = 'occurcountry:"US"' >>> URL = API + date + "+AND+" + country >>> URL 'https://api.fda.gov/drug/event.json?search=receivedate:[20180101+TO+20180105] > +AND+occurcountry:"US"' >>> end -------------------------------------------------------------------------------
We can use a similar strategy to further restrict our query to adverse events that involved the drug Fentanyl. The string drug in the code block below includes the syntax that specifies that the adverse event involved Fentanyl.
python: API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' URL = API + date + "+AND+" + country + "+AND+" + drug URL end
Finally, let’s specify that our results contain data for the number of adverse events that occurred on each day. The string data in the code block below contains the specific syntax and must be added to the end of the URL for the API call. Note that the string data must be preceded by & rather than +AND+.
python: API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data URL end
Our code block is still easy to read even though the URL for our API call has become quite complex.
. python: -------------------------------------------- python (type end to exit) -------- >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> country = 'occurcountry:"US"' >>> drug = 'patient.drug.openfda.brand_name:"Fentanyl"' >>> data = 'count=receivedate' >>> URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data >>> URL 'https://api.fda.gov/drug/event.json?search=receivedate:[20180101+TO+20180105] > +AND+occurcountry:"US"+AND+patient.drug.openfda.brand_name:"Fentanyl" > &count=receivedate' >>> end -------------------------------------------------------------------------------
Requesting data using an API call
Now, we are ready to submit our API call to the openFDA data server. Let’s begin by importing the requests package. We can use the get() method to submit the URL for our API call. Then, we will store the resulting JSON data in a dictionary object named data.
python: import requests API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data data = requests.get(URL).json() data end
We can view the contents of the data object by typing data as in the code block above. The data displayed in the output below are difficult to read because they have not been formatted for display.
. python: -------------------------------------------- python (type end to exit) -------- >>> import requests >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> country = 'occurcountry:"US"' >>> drug = 'patient.drug.openfda.brand_name:"Fentanyl"' >>> data = 'count=receivedate' >>> URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data >>> data = requests.get(URL).json() >>> data {'meta': {'disclaimer': 'Do not rely on openFDA to make decisions regarding > medical care. While we make every effort to ensure that data is accurate, you > should assume all results are unvalidated. We may limit or otherwise restrict > your access to the API in line with our Terms of Service.', 'terms': > 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', > 'last_updated': '2020-09-09'}, 'results': [{'time': '20180101', 'count': 1}, > {'time': '20180102', 'count': 16}, {'time': '20180103', 'count': 20}, > {'time': '20180104', 'count': 25}, {'time': '20180105', 'count': 24}]} >>> end -------------------------------------------------------------------------------
We can use the json module to display the data in a more readable format. Let’s begin by importing the json module in the code block below. Then, we can use the dumps() method to encode the JSON data. The indent=4 option displays the data with indentions for each level of nesting. The sort_keys=True option sorts the data. And print() tells Python to display the results of the dumps() method.
python: import requests import json API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data data = requests.get(URL).json() print(json.dumps(data, indent=4, sort_keys=True)) end
The data in the output below are easier to read. We can now see that the data are nested in the keys meta and results. The meta key contains a disclaimer, the date the data were last updated, a URL for the license, and the terms of use. This is useful information, but I do not want to include it in my dataset. I want to use only the data stored in the results key.
. python: -------------------------------------------- python (type end to exit) -------- >>> import requests >>> import json >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> country = 'occurcountry:"US"' >>> drug = 'patient.drug.openfda.brand_name:"Fentanyl"' >>> data = 'count=receivedate' >>> URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data >>> data = requests.get(URL).json() >>> print(json.dumps(data, indent=4, sort_keys=True)) { "meta": { "disclaimer": "Do not rely on openFDA to make decisions regarding > medical care. While we make every effort to ensure that data is accurate, > you should assume all results are unvalidated. We may limit or otherwise > restrict your access to the API in line with our Terms of Service.", "last_updated": "2020-09-09", "license": "https://open.fda.gov/license/", "terms": "https://open.fda.gov/terms/" }, "results": [ { "count": 1, "time": "20180101" }, { "count": 16, "time": "20180102" }, { "count": 20, "time": "20180103" }, { "count": 25, "time": "20180104" }, { "count": 24, "time": "20180105" } ] } >>> end -------------------------------------------------------------------------------
Convert the JSON data to a Stata dataset
We can use the get() method to extract the results portion of the data object and place it in a list object named fdadata.
python: import requests import json API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data data = requests.get(URL).json() fdadata = data.get('results', []) print(json.dumps(fdadata, indent=4, sort_keys=True)) end
We can verify that we extracted the data successfully by viewing the output below.
. python: -------------------------------------------- python (type end to exit) -------- >>> import requests >>> import json >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> country = 'occurcountry:"US"' >>> drug = 'patient.drug.openfda.brand_name:"Fentanyl"' >>> data = 'count=receivedate' >>> URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data >>> data = requests.get(URL).json() >>> fdadata = data.get('results', []) >>> print(json.dumps(fdadata, indent=4, sort_keys=True)) [ { "count": 1, "time": "20180101" }, { "count": 16, "time": "20180102" }, { "count": 20, "time": "20180103" }, { "count": 25, "time": "20180104" }, { "count": 24, "time": "20180105" } ] >>> end -------------------------------------------------------------------------------
The data in the fdadata list object are still in “key:value“, format and I would like to convert them to the “rows-and-columns” format of a pandas data frame. Let’s begin by importing the pandas module using the alias pd. Then, we can use the read_json() method to read the fdadata list object into a pandas data frame named fda_df.
python: import requests import json import pandas as pd API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data data = requests.get(URL).json() fdadata = data.get('results', []) fda_df = pd.read_json(json.dumps(fdadata)) fda_df end
The data frame fda_df displayed in the output below contains five rows and three columns. The first column is the index for the data frame. The second column, named “time”, contains the date of each observation. And the third column, named “count”, contains the number of adverse events in the United States that involved Fentanyl on that date.
. python: -------------------------------------------- python (type end to exit) -------- >>> import requests >>> import json >>> import pandas as pd >>> API = 'https://api.fda.gov/drug/event.json?search=' >>> date = 'receivedate:[20180101+TO+20180105]' >>> country = 'occurcountry:"US"' >>> drug = 'patient.drug.openfda.brand_name:"Fentanyl"' >>> data = 'count=receivedate' >>> URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data >>> data = requests.get(URL).json() >>> fdadata = data.get('results', []) >>> fda_df = pd.read_json(json.dumps(fdadata)) >>> fda_df time count 0 20180101 1 1 20180102 16 2 20180103 20 3 20180104 25 4 20180105 24 >>> end -------------------------------------------------------------------------------
Now, we can use the to_stata() method to save the pandas data frame fda_df to a Stata dataset named fentanyl.dta. The version=118 option specifies that the data will be stored in a Stata 16 data file.
python: import requests import json import pandas as pd API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20180101+TO+20180105]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data data = requests.get(URL).json() fdadata = data.get('results', []) fda_df = pd.read_json(json.dumps(fdadata)) fda_df.to_stata('fentanyl.dta', version=118) end
We can list the contents of the Stata data file fentanyl.dta to verify that the data were saved correctly.
. use fentanyl.dta, clear . list +--------------------------+ | index time count | |--------------------------| 1. | 0 20180101 1 | 2. | 1 20180102 16 | 3. | 2 20180103 20 | 4. | 3 20180104 25 | 5. | 4 20180105 24 | +--------------------------+
At this point, it would be easy to expand the range of the dates in our API call from January 1, 2010, through January 1, 2020, and graph the resulting data (see the code block below).
Conclusion
We did it! We successfully submitted an API call to openFDA, processed the resulting JSON data, and converted the JSON data to a Stata dataset. You may not be interested in adverse drug events reported to the FDA. But you can use similar steps to download and process all kinds of data that are useful to you. Just type “popular api data” in your search engine, and prepare to be amazed. Each API will have its own unique search fields and syntax, so you will need to read the documentation. But your patience and persistence will be rewarded with a world full of data.
I have collected the code below and added comments to remind you of the purpose of each collection of Python statements.
example.do
python: # Import packages import requests import json import pandas as pd # Construct the URL for the API call API = 'https://api.fda.gov/drug/event.json?search=' date = 'receivedate:[20100101+TO+20200101]' country = 'occurcountry:"US"' drug = 'patient.drug.openfda.brand_name:"Fentanyl"' data = 'count=receivedate' URL = API + date + "+AND+" + country + "+AND+" + drug + "&" + data # Submit the API data request data = requests.get(URL).json() # Extract the 'results' part of the JSON data fdadata = data.get('results', []) # Convert the JSON data to a pandas data frame fda_df = pd.read_json(json.dumps(fdadata)) # Use pandas to write the data frame to a Stata 16 dataset fda_df.to_stata('fentanyl.dta', version=118) end use fentanyl.dta, clear drop index generate date = mofd(date(string(time, "%8.0f"),"YMD")) format date %tm collapse (sum) count, by(date) tsset date, monthly twoway (line count date, lcolor(blue) lwidth(medthick)), /// ytitle("Adverse Events Reported to the FDA") /// ylabel(0(2000)8000, angle(horizontal) grid) /// xtitle("") /// title("Fentanyl Adverse Events Reported to the FDA") /// caption(Data Source: openFDA, size(small)) /// scheme(s1color)