Home > Data Management > Using import excel with real world data

Using import excel with real world data

Stata 12’s new import excel command can help you easily import real-world Excel files into Stata. Excel files often contain header and footer information in the first few and last few rows of a sheet, and you may not want that information loaded. Also, the column labels used in the sheet are invalid Stata variable names and therefore cannot be loaded. Both of these issues can be easily solved using import excel.

Let’s start by looking at an Excel spreadsheet, metro_gdp.xls, that is downloaded from the Bureau of Economic Analysis website.

Microsoft Excel screenshot

 

As you can see, the first five rows of the Excel file contain a description of the data, and rows 374 through 381 contain footer notes. We don’t want to load these rows into Stata. import excel has a cellrange() option that can help us avoid unwanted information being loaded.

With cellrange(), you specify the upper left cell and the lower right cell (using standard Excel notation) of the area of data you want loaded. In the file metro_gdp.xls, we want all the data from column A row 6 (upper left cell) to column L row 373 (lower right cell) loaded into Stata. To do this, we type

. import excel metro_gdp.xls, cellrange(A6:L373) clear

In Stata, we open the Data Editor to inspect the loaded data.

Stata Data Editor

 

The first row of the data we loaded contained column labels. Because of these labels, import excel loaded all the data as strings. import excel again has an easy fix. We need to specify the firstrow option to tell import excel that the first row of data contains the variable names.

. import excel metro_gdp.xls, cellrange(A6:L373) firstrow clear

We again open the Data Editor to inspect the data.

Stata Data Editor

 

The data are now in the correct format, but we are missing the year column labels. Stata does not accept numeric variable names, so import excel has to use the Excel column name (C, D, …) for the variable names instead of 2001, 2002, …. The simple solution is to rename the column headers in Excel to something like y2001, y2002, etc., before loading. You can also use Stata to rename the column headers. import excel saves the values in the first row of data as variable labels so that the information is not lost. If we describe the data, we will see all the column labels from the Excel file saved as variable labels.

. describe

Contains data
  obs:           367
 vars:            12
 size:        37,067
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
Fips            str5   %9s                    Fips
Area            str56  %56s                   Area
C               long   %10.0g                 2001
D               long   %10.0g                 2002
E               long   %10.0g                 2003
F               long   %10.0g                 2004
G               long   %10.0g                 2005
H               long   %10.0g                 2006
I               long   %10.0g                 2007
J               long   %10.0g                 2008
K               long   %10.0g                 2009
L               long   %10.0g                 2010
-------------------------------------------------------------------------------
Sorted by:
     Note:  dataset has changed since last saved

We want to grab the variable label for each variable by using the extended macro function :variable label varname, create a valid lowercase variable name from that label by using the strtoname() and lower() functions, and rename the variable to the new name by using rename. We can do this with a foreach loop.

foreach var of varlist _all {
        local label : variable label `var'
        local new_name = lower(strtoname("`label'"))
        rename `var' `new_name'
}

Now when we describe our data, they look like this:

. describe

Contains data
  obs:           367
 vars:            12
 size:        37,067                          
-------------------------------------------------------------------------------
              storage  display     value      
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
fips            str5   %9s                    Fips
area            str56  %56s                   Area
_2001           long   %10.0g                 2001
_2002           long   %10.0g                 2002
_2003           long   %10.0g                 2003
_2004           long   %10.0g                 2004
_2005           long   %10.0g                 2005
_2006           long   %10.0g                 2006
_2007           long   %10.0g                 2007
_2008           long   %10.0g                 2008
_2009           long   %10.0g                 2009
_2010           long   %10.0g                 2010
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

One last thing we might want to do is to rename the year variables from _20## to y20##, which we can easily accomplish with rename:

. rename (_*) (y*)

. describe

Contains data
  obs:           367
 vars:            12
 size:        37,067                          
-------------------------------------------------------------------------------
              storage  display     value      
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
fips            str5   %9s                    Fips
area            str56  %56s                   Area
y2001           long   %10.0g                 2001
y2002           long   %10.0g                 2002
y2003           long   %10.0g                 2003
y2004           long   %10.0g                 2004
y2005           long   %10.0g                 2005
y2006           long   %10.0g                 2006
y2007           long   %10.0g                 2007
y2008           long   %10.0g                 2008
y2009           long   %10.0g                 2009
y2010           long   %10.0g                 2010
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved
Categories: Data Management Tags: ,
  • Stephanie

    does it also work with xlsx files?

  • http://blog.stata.com/ The Stata Blog Team

    Yes, both .xls and .xlsx files are supported.

  • Dale

    Can you provide a URL to download the Excel file used in the example?

  • abuxton

    Is there function to give a list of names in sequence for importing
    from Excel?

    This program ‘wseries’ gives a list to copy and edit by text editor,
    or typing, the list that goes into the Stata import excel ‘extvarlist’
    argument. This does, of course, allow selection of a subset of needed
    columns to read out of excel.
    So the program is only to produce text facilitate the creation of a
    do file by creating the pattern out the Excel column number names
    as in the following example.
    You change the number list for local variable `s’ to say how many series
    A-Z are needed.

    Is there function to give a list of names in sequence for importing
    from Excel?

    *** top wseries ***
    capture program drop wseries
    program define wseries
    local albt `”A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0″‘
    foreach s of numlist -1 1 2 {
    foreach i of numlist 1/26 {
    local Txt `”=`=word(“`albt'” , `s’)’`=word(“`albt'” , `i’)'”‘
    di `”`=subinstr(`”`Txt'”‘,”0″,””,1)'”‘
    }
    }
    end
    *** end wseries ***

    *** top example wseries result ***
    . wseries
    =A ** `s’==-1, `i’==1
    =B
    =C

    =Z ** `s’==-1, `i’==26
    =AA ** `s’==1, `i’==1
    =AB
    =AC

    =BT ** `s’==2,`i’==20
    =BU
    =BV
    =BW
    =BX
    =BY
    =BZ
    *** end example wseries result ***

    *** top example edited wseries result for inclusion in a do-file – import excel ***
    #delimit;
    *** the contents of `vlistC’ is to be facilitated ;
    local vlistC `”
    matchrec =A
    bm_type =B
    p_id =C

    course =BV
    wl_bmdt =BW
    “‘;
    #delimit cr

    #delimit;
    ***import excel extvarlist using filename [, import_excel_options]*** ;
    import excel `vlistC’ using Z:filenameA.xlsx,
    sheet(“sheetnameA”) cellrange(A2:BW208) clear;
    *des;
    #delimit cr
    *** end example edited wseries result for inclusion in a do-file – import excel ***

  • Kevin Crow

    You can use the undocumented mata function numtobase26() to get the Excel column names. For example,

    mata
    : col = numtobase26(28) ;
    : col
    AB
    : nums = (1,2,3)
    : cols = numtobase26(nums)
    : cols
    1 2 3
    +————-+
    1 | A B C |
    +————-+
    :end

  • Kevin Crow

    The Excel file was generated by the Bureau of Economic Analysis website so there is not a URL I can point you to.

  • Lindsay

    Hello, when running the loop provided (above), I get a syntax error:

    Syntax is
    rename oldname newname [, renumber[(#)] addnumber[(#)] sort …]
    rename (oldnames) (newnames) [, renumber[(#)] addnumber[(#)] sort …]
    rename oldnames , {upper|lower|proper}

    is this a bug?

  • Kevin Crow

    Does one of your variables not have a variable label? If so, that would generate an error in the loop. To debug your code add the following line of code to the loop above the rename command.

    display “var=|`var’| new_name=|`new_name’|”

  • LM

    Is there a way to combine _import excel_ with _webuse_? I need to import multiple excel data sets into stata from an online source and webuse doesn’t appear to take import as an option.

  • Kevin Crow

    import excel works with a URL. For example,

    import excel http://www.stata.com/test.xls, clear

  • Idrees

    I want to import excel data in to stata but it gets imported as strings may some one please help me sort it out. i had not opted for strings during import.