Update to Import COVID-19 post
In my last post, I mentioned that I did not want to distribute my covid19.ado file because “it could be rendered useless if or when Johns Hopkins changes its data”. I wrote that on March 19, 2020, and the data changed on March 23, 2020. This will likely happen again (and again, and again …). I may post updates in the future as the data change, but you may need to adapt sooner than I can post. So let’s see how we can update our code to adapt to the changing data.
Let’s begin by running the code from my last blog post.
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture save "`today'", replace
}
}
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
Something looks wrong when we describe our data.
. describe
Contains data
obs: 11,341
vars: 17
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
provincestate str43 %43s Province/State
countryregion str32 %32s Country/Region
lastupdate str19 %19s Last Update
confirmed long %8.0g Confirmed
deaths int %8.0g Deaths
recovered long %8.0g Recovered
latitude float %9.0g Latitude
longitude float %9.0g Longitude
fips long %12.0g FIPS
admin2 str21 %21s Admin2
province_state str28 %28s Province_State
country_region str32 %32s Country_Region
last_update str19 %19s Last_Update
lat float %9.0g Lat
long_ float %9.0g Long_
active long %12.0g Active
combined_key str44 %44s Combined_Key
------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
We have variables with similar names, such as provincestate and province_state, countryregion and country_region, and so forth. The variable names have changed in the newer raw files. But we must have the same variable names when we append the data.
I looked through the most recent raw data files and identified the date on which the data changed. You can do this without opening the files. You can simply describe the data from your local disk or cloud account.
The raw data from March 22, 2020, use the old variable names.
. describe using 03-22-2020.dta
Contains data
obs: 309 24 Mar 2020 11:48
vars: 8
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
provincestate str28 %28s Province/State
countryregion str32 %32s Country/Region
lastupdate str19 %19s Last Update
confirmed long %12.0g Confirmed
deaths int %8.0g Deaths
recovered long %12.0g Recovered
latitude float %9.0g Latitude
longitude float %9.0g Longitude
------------------------------------------------------------------------
Sorted by:
The raw data from March 23, 2020, use the new variable names.
. describe using 03-23-2020.dta
Contains data
obs: 3,415 24 Mar 2020 11:48
vars: 12
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
fips long %12.0g FIPS
admin2 str21 %21s Admin2
province_state str28 %28s Province_State
country_region str32 %32s Country_Region
last_update str19 %19s Last_Update
lat float %9.0g Lat
long_ float %9.0g Long_
confirmed long %12.0g Confirmed
deaths int %8.0g Deaths
recovered long %12.0g Recovered
active long %12.0g Active
combined_key str44 %44s Combined_Key
------------------------------------------------------------------------
Sorted by:
We could write some clever code to distinguish between files created before and after March 23. But a simple alternative is to use capture rename to change the variable names where necessary in the raw data files.
Let’s try this on the raw data file for March 23 before we incorporate it into the rest of our code.
. use 03-23-2020.dta
. capture rename province_state provincestate
. capture rename country_region countryregion
. capture rename last_update lastupdate
. capture rename lat latitude
. capture rename long longitude
. describe
Contains data from 03-23-2020.dta
obs: 3,415
vars: 12 24 Mar 2020 11:48
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
fips long %12.0g FIPS
admin2 str21 %21s Admin2
provincestate str28 %28s Province_State
countryregion str32 %32s Country_Region
lastupdate str19 %19s Last_Update
latitude float %9.0g Lat
longitude float %9.0g Long_
confirmed long %12.0g Confirmed
deaths int %8.0g Deaths
recovered long %12.0g Recovered
active long %12.0g Active
combined_key str44 %44s Combined_Key
------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
The variable names in the new data now match the variable names in the old data. Some variables in the newer data did not appear in the old data. Those new variables will be appended to the final dataset but will not contain any data for dates prior to March 23.
The updated code below will import the raw data from the Johns Hopkins GitHub repository as of March 23, 2020. I have displayed the new commands in red.
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture rename province_state provincestate
capture rename country_region countryregion
capture rename last_update lastupdate
capture rename lat latitude
capture rename long longitude
capture save "`today'", replace
}
}
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
We can verify that this worked by describing the resulting data.
. describe
Contains data
obs: 11,341
vars: 12
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
provincestate str43 %43s Province/State
countryregion str32 %32s Country/Region
lastupdate str19 %19s Last Update
confirmed long %8.0g Confirmed
deaths int %8.0g Deaths
recovered long %8.0g Recovered
latitude float %9.0g Latitude
longitude float %9.0g Longitude
fips long %12.0g FIPS
admin2 str21 %21s Admin2
active long %12.0g Active
combined_key str44 %44s Combined_Key
------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Let’s save this dataset so we can use it later.
. save covid19_raw file covid19_raw.dta saved
Please note that we have not checked and cleaned these data. The code above and the resulting data should be used for instructional purposes only.
I will show you how to convert the raw data to time-series data in my next post.