Using import excel with real world data
Stata 12’s new import excel command can help you easily import real-world Excel files into Stata. Excel files often contain header and footer information in the first few and last few rows of a sheet, and you may not want that information loaded. Also, the column labels used in the sheet are invalid Stata variable names and therefore cannot be loaded. Both of these issues can be easily solved using import excel.
Let’s start by looking at an Excel spreadsheet, metro_gdp.xls, that is downloaded from the Bureau of Economic Analysis website.
As you can see, the first five rows of the Excel file contain a description of the data, and rows 374 through 381 contain footer notes. We don’t want to load these rows into Stata. import excel has a cellrange() option that can help us avoid unwanted information being loaded.
With cellrange(), you specify the upper left cell and the lower right cell (using standard Excel notation) of the area of data you want loaded. In the file metro_gdp.xls, we want all the data from column A row 6 (upper left cell) to column L row 373 (lower right cell) loaded into Stata. To do this, we type
. import excel metro_gdp.xls, cellrange(A6:L373) clear
In Stata, we open the Data Editor to inspect the loaded data.
The first row of the data we loaded contained column labels. Because of these labels, import excel loaded all the data as strings. import excel again has an easy fix. We need to specify the firstrow option to tell import excel that the first row of data contains the variable names.
. import excel metro_gdp.xls, cellrange(A6:L373) firstrow clear
We again open the Data Editor to inspect the data.
The data are now in the correct format, but we are missing the year column labels. Stata does not accept numeric variable names, so import excel has to use the Excel column name (C, D, …) for the variable names instead of 2001, 2002, …. The simple solution is to rename the column headers in Excel to something like y2001, y2002, etc., before loading. You can also use Stata to rename the column headers. import excel saves the values in the first row of data as variable labels so that the information is not lost. If we describe the data, we will see all the column labels from the Excel file saved as variable labels.
. describe Contains data obs: 367 vars: 12 size: 37,067 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- Fips str5 %9s Fips Area str56 %56s Area C long %10.0g 2001 D long %10.0g 2002 E long %10.0g 2003 F long %10.0g 2004 G long %10.0g 2005 H long %10.0g 2006 I long %10.0g 2007 J long %10.0g 2008 K long %10.0g 2009 L long %10.0g 2010 ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved
We want to grab the variable label for each variable by using the extended macro function :variable label varname, create a valid lowercase variable name from that label by using the strtoname() and lower() functions, and rename the variable to the new name by using rename. We can do this with a foreach loop.
foreach var of varlist _all { local label : variable label `var' local new_name = lower(strtoname("`label'")) rename `var' `new_name' }
Now when we describe our data, they look like this:
. describe Contains data obs: 367 vars: 12 size: 37,067 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- fips str5 %9s Fips area str56 %56s Area _2001 long %10.0g 2001 _2002 long %10.0g 2002 _2003 long %10.0g 2003 _2004 long %10.0g 2004 _2005 long %10.0g 2005 _2006 long %10.0g 2006 _2007 long %10.0g 2007 _2008 long %10.0g 2008 _2009 long %10.0g 2009 _2010 long %10.0g 2010 ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved
One last thing we might want to do is to rename the year variables from _20## to y20##, which we can easily accomplish with rename:
. rename (_*) (y*) . describe Contains data obs: 367 vars: 12 size: 37,067 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- fips str5 %9s Fips area str56 %56s Area y2001 long %10.0g 2001 y2002 long %10.0g 2002 y2003 long %10.0g 2003 y2004 long %10.0g 2004 y2005 long %10.0g 2005 y2006 long %10.0g 2006 y2007 long %10.0g 2007 y2008 long %10.0g 2008 y2009 long %10.0g 2009 y2010 long %10.0g 2010 ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved