Update to Import COVID-19 post
In my last post, I mentioned that I did not want to distribute my covid19.ado file because “it could be rendered useless if or when Johns Hopkins changes its data”. I wrote that on March 19, 2020, and the data changed on March 23, 2020. This will likely happen again (and again, and again …). I may post updates in the future as the data change, but you may need to adapt sooner than I can post. So let’s see how we can update our code to adapt to the changing data.
Let’s begin by running the code from my last blog post.
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture save "`today'", replace
}
}
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
Something looks wrong when we describe our data.
. describe Contains data obs: 11,341 vars: 17 ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ provincestate str43 %43s Province/State countryregion str32 %32s Country/Region lastupdate str19 %19s Last Update confirmed long %8.0g Confirmed deaths int %8.0g Deaths recovered long %8.0g Recovered latitude float %9.0g Latitude longitude float %9.0g Longitude fips long %12.0g FIPS admin2 str21 %21s Admin2 province_state str28 %28s Province_State country_region str32 %32s Country_Region last_update str19 %19s Last_Update lat float %9.0g Lat long_ float %9.0g Long_ active long %12.0g Active combined_key str44 %44s Combined_Key ------------------------------------------------------------------------ Sorted by: Note: Dataset has changed since last saved.
We have variables with similar names, such as provincestate and province_state, countryregion and country_region, and so forth. The variable names have changed in the newer raw files. But we must have the same variable names when we append the data.
I looked through the most recent raw data files and identified the date on which the data changed. You can do this without opening the files. You can simply describe the data from your local disk or cloud account.
The raw data from March 22, 2020, use the old variable names.
. describe using 03-22-2020.dta Contains data obs: 309 24 Mar 2020 11:48 vars: 8 ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ provincestate str28 %28s Province/State countryregion str32 %32s Country/Region lastupdate str19 %19s Last Update confirmed long %12.0g Confirmed deaths int %8.0g Deaths recovered long %12.0g Recovered latitude float %9.0g Latitude longitude float %9.0g Longitude ------------------------------------------------------------------------ Sorted by:
The raw data from March 23, 2020, use the new variable names.
. describe using 03-23-2020.dta Contains data obs: 3,415 24 Mar 2020 11:48 vars: 12 ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ fips long %12.0g FIPS admin2 str21 %21s Admin2 province_state str28 %28s Province_State country_region str32 %32s Country_Region last_update str19 %19s Last_Update lat float %9.0g Lat long_ float %9.0g Long_ confirmed long %12.0g Confirmed deaths int %8.0g Deaths recovered long %12.0g Recovered active long %12.0g Active combined_key str44 %44s Combined_Key ------------------------------------------------------------------------ Sorted by:
We could write some clever code to distinguish between files created before and after March 23. But a simple alternative is to use capture rename to change the variable names where necessary in the raw data files.
Let’s try this on the raw data file for March 23 before we incorporate it into the rest of our code.
. use 03-23-2020.dta . capture rename province_state provincestate . capture rename country_region countryregion . capture rename last_update lastupdate . capture rename lat latitude . capture rename long longitude . describe Contains data from 03-23-2020.dta obs: 3,415 vars: 12 24 Mar 2020 11:48 ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ fips long %12.0g FIPS admin2 str21 %21s Admin2 provincestate str28 %28s Province_State countryregion str32 %32s Country_Region lastupdate str19 %19s Last_Update latitude float %9.0g Lat longitude float %9.0g Long_ confirmed long %12.0g Confirmed deaths int %8.0g Deaths recovered long %12.0g Recovered active long %12.0g Active combined_key str44 %44s Combined_Key ------------------------------------------------------------------------ Sorted by: Note: Dataset has changed since last saved.
The variable names in the new data now match the variable names in the old data. Some variables in the newer data did not appear in the old data. Those new variables will be appended to the final dataset but will not contain any data for dates prior to March 23.
The updated code below will import the raw data from the Johns Hopkins GitHub repository as of March 23, 2020. I have displayed the new commands in red.
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture rename province_state provincestate
capture rename country_region countryregion
capture rename last_update lastupdate
capture rename lat latitude
capture rename long longitude
capture save "`today'", replace
}
}
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
We can verify that this worked by describing the resulting data.
. describe Contains data obs: 11,341 vars: 12 ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ provincestate str43 %43s Province/State countryregion str32 %32s Country/Region lastupdate str19 %19s Last Update confirmed long %8.0g Confirmed deaths int %8.0g Deaths recovered long %8.0g Recovered latitude float %9.0g Latitude longitude float %9.0g Longitude fips long %12.0g FIPS admin2 str21 %21s Admin2 active long %12.0g Active combined_key str44 %44s Combined_Key ------------------------------------------------------------------------ Sorted by: Note: Dataset has changed since last saved.
Let’s save this dataset so we can use it later.
. save covid19_raw file covid19_raw.dta saved
Please note that we have not checked and cleaned these data. The code above and the resulting data should be used for instructional purposes only.
I will show you how to convert the raw data to time-series data in my next post.