Archive

Archive for the ‘Programming’ Category

Retaining an Excel cell’s format when using putexcel

In a previous blog entry, I talked about the new Stata 13 command putexcel and how we could use putexcel with a Stata command’s stored results to create tables in an Excel file.

After the entry was posted, a few users pointed out two features they wanted added to putexcel:

  1. Retain a cell’s format after writing numeric data to it.
  2. Allow putexcel to format a cell.

In Stata 13.1, we added the new option keepcellformat to putexcel. This option retains a cell’s format after writing numeric data to it. keepcellformat is useful for people who want to automate the updating of a report or paper.

To review, the basic syntax of putexcel is as follows:

putexcel excel_cell=(expression) … using filename[, options]

If you are working with matrices, the syntax is

putexcel excel_cell=matrix(expression) … using filename[, options]

In the previous blog post, we exported a simple table created by the correlate command by using the commands below.

. sysuse auto
(1978 Automobile Data)

. correlate foreign mpg
(obs=74)

             |  foreign      mpg
-------------+------------------
     foreign |   1.0000
         mpg |   0.3934   1.0000

. putexcel A1=matrix(r(C), names) using corr

These commands created the file corr.xlsx, which contained the table below in the first worksheet.

results0

As you can see, this table is not formatted. So, I formatted the table by hand in Excel so that the correlations were rounded to two digits and the column and row headers were bold with a blue background.

results1

putexcel‘s default behavior is to remove the formatting of cells. Thus, if we want to change the correlated variables in our command from foreign and mpg to foreign and weight using the below commands, the new correlations shown in Excel will revert to the default format:

. sysuse auto, clear
(1978 Automobile Data)

. correlate foreign weight
(obs=74)

             |  foreign   weight
-------------+------------------
     foreign |   1.0000
      weight |  -0.5928   1.0000

. putexcel A1=matrix(r(C), names) using corr, modify

results2

As of Stata 13.1, you can now use the keepcellformat option to preserve a numeric cell’s format when writing to it. For example, the command

. putexcel A1=matrix(r(C), names) using corr, modify keepcellformat

will produce

results3

Let’s look at a real-world problem and really see how the keepcellformat option can help us. Suppose we need to export the following tabulate table to a report we wrote in Word.

. webuse auto2, clear
(1978 Automobile Data)

. label variable rep78 "Repair Record"

. tabulate rep78

     Repair |
     Record |      Freq.     Percent        Cum.
------------+-----------------------------------
       Poor |          2        2.90        2.90
       Fair |          8       11.59       14.49
    Average |         30       43.48       57.97
       Good |         18       26.09       84.06
  Excellent |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00

In the previous putexcel blog post, I mentioned my user-written command tab2xl, which exports a one-way tabulation to an Excel file. I have since updated the command so that it uses the new keepcellformat option to preserve cell formatting. You can download the updated tab2xl command by typing the following:

. net install http://www.stata.com/users/kcrow/tab2xl, replace

Using this command, I can now export my tabulate table to Excel by typing

. tab2xl rep78 using tables, row(1) col(1)

Once the table is in Excel, I format it by hand so that it looks like this:

results4

I then link this Excel table to a Word document. When you link an Excel table to a Word document, it

  1. preserves the formatting of the table and
  2. automatically updates the Word document when you update the Excel table.

It is fairly easy to link an Excel table to a Word document or PowerPoint presentation. In Excel/Word 2010, you would do as follows:

  1. Highlight the table/data in Excel.
  2. On the Home tab, click on the Copy button.
  3. Open the Word document and scroll to where you want the table pasted.
  4. On the Home tab of Word, click on the Paste button.
  5. Select Link & Keep Source Formatting, Link, from the Paste icon menu.

My report now looks like this:

report1

With the Excel table linked into Word, any time we update our Excel table using putexcel, we also update our table in Word.

Suppose that after a few weeks, we get more repair record data. We now need to update our report, and our new tabulate table looks like this:

. tabulate rep78

     Repair |
     Record |      Freq.     Percent        Cum.
------------+-----------------------------------
       Poor |          4        2.90        2.90
       Fair |          8        5.80        8.70
    Average |         60       43.48       52.17
       Good |         44       31.88       84.06
  Excellent |         22       15.94      100.00
------------+-----------------------------------
      Total |        138      100.00

To update the report, we simply need to reissue the putexcel command after tabulate.

. tabulate rep78
. tab2xl rep78 using tables, row(1) col(1)

The linked Word report will automatically reflect the changes:

report2

Categories: Programming Tags: , , , , ,

Export tables to Excel

There is a new command in Stata 13, putexcel, that allows you to easily export matrices, expressions, and stored results to an Excel file. Combining putexcel with a Stata command’s stored results allows you to create the table displayed in your Stata Results window in an Excel file.

A stored result is simply a scalar, macro, or matrix stored in memory after you run a Stata command. The two main types of stored results are e-class (for estimation commands) and r-class (for general commands). You can list a command’s stored results after it has been run by typing ereturn list (for estimation commands) and return list (for general commands). Let’s try a simple example by loading the auto dataset and running correlate on the variables foreign and mpg

. sysuse auto
(1978 Automobile Data)

. correlate foreign mpg
(obs=74)

             |  foreign      mpg
-------------+------------------
     foreign |   1.0000
         mpg |   0.3934   1.0000

Because correlate is not an estimation command, use the return list command to see its stored results.

. return list

scalars:
                  r(N) =  74
                r(rho) =  .3933974152205484

matrices:
                  r(C) :  2 x 2

Now we can use putexcel to export these results to Excel. The basic syntax of putexcel is

putexcel excel_cell=(expression) … using filename [, options]

If you are working with matrices, the syntax is

putexcel excel_cell=matrix(expression) … using filename [, options]

It is easy to build the above syntax in the putexcel dialog. There is a helpful video on Youtube about the dialog here. Let’s list the matrix r(C) to see what it contains.

. matrix list r(C)

symmetric r(C)[2,2]
           foreign        mpg
foreign          1
    mpg  .39339742          1

To re-create the table in Excel, we need to export the matrix r(C) with the matrix row and column names. The command to type in your Stata Command window is

putexcel A1=matrix(r(C), names) using corr

Note that to export the matrix row and column names, we used the names option after we specifed the matrix r(C). When I open the file corr.xlsx in Excel, the table below is displayed.

results0

Next let’s try a more involved example. Load the auto dataset, and run a tabulation on the variable foreign. Because tabulate is not an estimation command, use the return list command to see its stored results.

. sysuse auto
(1978 Automobile Data)

. tabulate foreign

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

. return list

scalars:
                  r(N) =  74
                  r(r) =  2

tabulate is different from most commands in Stata in that it does not automatically save all the results we need into the stored results (we will use scalar r(N)). We need to use the matcell() and matrow() options of tabulate to save the results produced by the command into two Stata matrices.

. tabulate foreign, matcell(freq) matrow(names)

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

. matrix list freq

freq[2,1]
    c1
r1  52
r2  22

. matrix list names

names[2,1]
    c1
r1   0
r2   1

The putexcel commands used to create a basic tabulation table in Excel column 1 row 1 are

putexcel A1=("Car type") B1=("Freq.") C1=("Percent") using results, replace
putexcel A2=matrix(names) B2=matrix(freq) C2=matrix(freq/r(N)) using results,
     modify

Below is the table produced in Excel by these commands.

results1

Again this is a basic tabulation table. You probably noticed that we did not have the Cum. column or the Total row in the export table. Also our Car type column contains the numeric values (0,1), not the value lables (Domestic, Foreign) of the variable foreign, and our Percent column is not formatted correctly. To get the exact table displayed in the Results window into an Excel file takes a little programming. With a few functions and a forvalues loop, we can easily export any table produced by running the tabulate command on a numeric variable.

There are two extended macro functions, label and display, that can help us. The label function can extract the value labels for each variable, and the display function can correctly format numbers for our numeric columns. Last, we use forvalues to loop over the rows of the returned matrices to produce our final tables. Our do-file to produce the tabulate table in Excel looks like

sysuse auto
tabulate foreign, matcell(freq) matrow(names)

putexcel A1=("Car type") B1=("Freq.") C1=("Percent") D1=("Cum.") using results, replace

local rows = rowsof(names)
local row = 2
local cum_percent = 0

forvalues i = 1/`rows' {

        local val = names[`i',1]
        local val_lab : label (foreign) `val'

        local freq_val = freq[`i',1]

        local percent_val = `freq_val'/`r(N)'*100
        local percent_val : display %9.2f `percent_val'

        local cum_percent : display %9.2f (`cum_percent' + `percent_val')

        putexcel A`row'=("`val_lab'") B`row'=(`freq_val') C`row'=(`percent_val') ///
                D`row'=(`cum_percent') using results, modify
        local row = `row' + 1
}

putexcel A`row'=("Total") B`row'=(r(N)) C`row'=(100.00) using results, modify

The above commands produce this table in Excel:

results2

The solution above works well for this one table, but what if we need to export the tabulation table for 100 variables to the same Excel spreadsheet? It would be very tedious to run the same do-file 100 times, each time changing the cell and row numbers. Now we could easily change our do-file into the Stata command (ado-file) called tab2xl. The syntax for our new command could be

tab2xl varname using filename, row(rownumber) col(colnumber) [replace sheet(name)]

The pseudocode of our program (file tab2xl.ado) looks like

program tab2xl
  /* parse command syntax */

  /* tabulate varname */

  /* get column letters based on starting column number passed in */

  /* write header row to filename in starting row number passed in */

  /* loop over rows of returned matrix and calculate/write values to filename */

  /* write total row to filename */
end

If you would like to download a working version of our tab2xl command, type

net install http://www.stata.com/users/kcrow/tab2xl

in Stata.

Categories: Programming Tags: , , , ,

A tip to debug your nl/nlsur function evaluator program

If you have a bug in your evaluator program, nl will produce, most probably, the following error:

your program returned 198
verify that your program is a function evaluator program
r(198);

The error indicates that your program cannot be evaluated.

The best way to spot any issues in your evaluator program is to run it interactively. You just need to define your sample (usually observations where none of the variables are missing), and a matrix with values for your parameters. Let me show you an example with nlces2. This is the code to fit the CES production function, from the documentation for the nl command:

cscript
program nlces2
version 12
syntax varlist(min=3 max=3) if, at(name)
local logout : word 1 of `varlist'
local capital : word 2 of `varlist'
local labor : word 3 of `varlist'
// Retrieve parameters out of at matrix
tempname b0 rho delta
scalar `b0' = `at'[1, 1]
scalar `rho' = `at'[1, 2]
scalar `delta' = `at'[1, 3]
tempvar kterm lterm
generate double `kterm' = `delta'*`capital'^(-1*`rho') `if'
generate double `lterm' = (1-`delta')*`labor'^(-1*`rho') `if'
// Fill in dependent variable
replace `logout' = `b0' - 1/`rho'*ln(`kterm' + `lterm') `if'
end

webuse production, clear
nl ces2 @ lnoutput capital labor, parameters(b0 rho delta) ///
               initial(b0 0 rho 1 delta 0.5)

Now, let me show you how to run it interactively:

webuse production, clear
*generate a variable to restrict my sample to observations
*with non-missing values in my variables
egen u = rowmiss(lnoutput capital labor)

*generate a matrix with parameters where I will evaluate my function
mat M = (0,1,.5)
gen nloutput_new = 1 
nlces2 nloutput_new capital labor if u==0, at(M)  

This will evaluate the program only once, using the parameters in matrix M. Notice that I generated a new variable to use as my dependent variable. This is because the program nlces2, when run by itself, will modify the dependent variable.
When you run this program by itself, you will obtain a more specific error message. You can add debugging code to this program, and you can also use the trace setting to see how each step is executed. Type help trace to learn about this setting.

Another possible source of error (which will generate error r(480) when run from nl) is when an evaluator function produces missing values for observations in the sample. If this is the case, you will see those missing values in the variable nloutput_new, i.e., in the variable you entered as dependent when running your evaluator by itself. You can then add debugging code, for example, using codebook or summarize to examine the different parts that contribute to the substitution performed in the dependent variable.

For example, after the line that generates `kterm’, I could write

summarize `kterm' if u == 0

to see if this variable contains any missing values in my sample.

This method can also be used to debug your function evaluator programs for nlsur. In order to preserve your dataset, you need to use copies for all the dependent variables in your model.

Categories: Programming Tags: , , ,

Advanced Mata: Pointers

I’m still recycling my talk called “Mata, The Missing Manual” at user meetings, a talk designed to make Mata more approachable. One of the things I say late in the talk is, “Unless you already know what pointers are and know you need them, ignore them. You don’t need them.” And here I am writing about, of all things, pointers. Well, I exaggerated a little in my talk, but just a little.

Before you take my previous advice and stop reading, let me explain: Mata serves a number of purposes and one of them is as the primary langugage we at StataCorp use to implement new features in Stata. I’m not referring to mock ups, toys, and experiments, I’m talking about ready-to-ship code. Stata 12′s Structural Equation Modeling features are written in Mata, so is Multiple Imputation, so is Stata’s optimizer that is used by nearly all estimation commands, and so are most features. Mata has a side to it that is exceedingly serious and intended for use by serious developers, and every one of those features are available to users just as they are to StataCorp developers. This is one of the reasons there are so many user-written commands are available for Stata. Even if you don’t use the serious features, you benefit.

So every so often I need to take time out and address the concerns of these user/developers. I knew I needed to do that now when Kit Baum emailed a question to me that ended with “I’m stumped.” Kit is the author of An Introduction to Stata Programming which has done more to make Mata approachable to professional researchers than anything StataCorp has done, and Kit is not often stumped.

I have a certain reptutation about how I answer most questions. “Why do you want to do that?” I invariably reply, or worse, “You don’t want to do that!” and then go on to give the answer to the question I wished they had asked. When Kit asks a question, however, I just answer it. Kit asked a question about pointers by setting up an artificial example and I have no idea what his real motivation was, so I’m not even going to try to motivate the question for you. The question is interesting in and of itself anyway.

Here is Kit’s artificial example:

real function x2(real scalar x) return(x^2)
 
real function x3(real scalar x) return(x^3) 

void function tryit() 
{
        pointer(real scalar function) scalar fn
        string rowvector                     func
        real scalar                          i

        func = ("x2", "x3")
        for(i=1;i<=length(func);i++) {
                fn = &(func[i])
                (*fn)(4)
        }
}

Kit is working with pointers, and not just pointers to variables, but pointers to functions. A pointer is the memory address, the address where the variable or function is stored. Real compilers translate names into memory addresses which is one of the reasons real compilers produce code that runs fast. Mata is a real compiler. Anyway, pointers are memory addresses, such as 58, 212,770, 427,339,488, except the values are usually written in hexadecimal rather than decimal. In the example, Kit has two functions, x2(x) and x3(x). Kit wants to create a vector of the function addresses and then call each of the functions in the vector. In the artificial example, he's calling each with an argument of 4.

The above code does not work:

: tryit()
         tryit():  3101  matrix found where function required
         <istmt>:     -  function returned error

The error message is from the Mata compiler and it's complaining about the line

        (*fn)(4)

but the real problem is earlier in the tryit() code.

One corrected version of tryit() would read,

void function tryit()
{
        pointer(real scalar function) scalar fn
        pointer(real scalar function) vector func     // <---
        real scalar                          i

        func = (&x2(), &x3())                         // <---
        for(i=1;i<=length(func);i++) {
                fn = func[i]                          // <---
                (*fn)(4)
        }
}

If you make the three changes I marked, tryit() works:

: tryit()
  16
  64

I want to explain this code and alternative ways the code could have been fixed. It will be easier if we just work interactively, so let's start all over again:

: real scalar x2(x) return(x^2)

: real scalar x3(x) return(x^3)

: func = (&x2(), &x3())

Let's take a look at what is in func:

: func
                1            2
    +---------------------------+
  1 |  0x19551ef8   0x19552048  |
    +---------------------------+

Those are memory addresses. When we typed &x2() and &x3() in the line

: func = (&x2(), &x3())

functions x2() and x3() were not called. &x2() and &x3() instead evaluate to the addresses of the functions named x2() and x3(). I can demonstrate this:

: &x2()
  0x19551ef8

0x19551ef8 is the memory address of where the function x2() is stored. 0x19551ef8 may not look like a number, but that is only because it is presented in base 16. 0x19551ef8 is in fact the number 425,008,888, and the compiled code for the function x2() starts at the 425,008,888th byte of memory and continues thereafter.

Let's assign to fn the value of the address of one of the functions, say x2(). I could do that by typing

: fn = func[1]

or by typing

: fn = &x2()

and either way, when I look at fn, it contains a memory address:

: fn
  0x19551ef8

Let's now call the function whose address we have stored in fn:

: (*fn)(2)
  4

When we call a function and want to pass 2 as an argument, we normally code f(2). In this case, we substitute (*fn) for f because we do not want to call the function named f(), we want to call the function whose address is stored in variable fn. The operator * usually means multiplication, but when * is used as a prefix, it means something different, in much the same way the minus operator - can be subtract or negate. The meaning of unary * is "the contents of". When we code *fn, we mean not the value 425,008,888 stored in fn, we mean the contents of the memory address 425,008,888, which happens to be the function x2().

We type (*fn)(2) and not *fn(2) because *fn(2) would be interpreted to mean *(fn(2)). If there were a function named fn(), that function would be called with argument 2, the result obtained, and then the star would take the contents of that memory address, assuming fn(2) returned a memory address. If it didn't, we'd get a type mismatch error.

The syntax can be confusing until you understand the reasoning behind it. Let's start with all new names. Consider something named X. Actually, there could be two different things named X and Mata would not be confused. There could be a variable named X and there could be a function named X(). To Mata, X and X() are different things, or said in the jargon, have different name spaces. In Mata, variables and functions can have the same names. Variables and functions having the same names in C is not allowed -- C has only one name space. So in C, you can type

fn = &x2

to obtain the address of variable x2 or function x2(), but in Mata, the above means the address of the variable x2, and if there is no such variable, that's an error. In Mata, to obtain the address of function x2(), you type

fn = &x2()

The syntax &x2() is a definitional nugget; there is no taking it apart to understand its logic. But we can take apart the logic of the programmer who defined the syntax. & means "address of" and &thing means to take the address of thing. If thing is a name -- &name -- that means to look up name in the variable space and return its address. If thing is name(), that means look up name in the function space and return its address. They way we formally write this grammar is

 &thing, where 

 thing  :=   name
             name()
             exp

There are three possibilities for thing; it's a name or it's a name followed by () or it's an expression. The last is not much used. &2 creates a literal 2 and then tells you the address where the 2 is stored, which might be 0x195525d8. &(2+3) creates 5 and then tells you where the 5 is stored.

But let's get back to Kit's problem. Kit coded,

func = ("x2", "x3")

and I said no, code instead

func = (&x2(), &x3())

You do not use strings to obtain pointers, you use the actual name prefixed by ampersand.

There's a subtle difference in what Kit was trying to code and what I did code, however. In what Kit tried to code, Kit was seeking "run-time binding". I, however, coded "compile-time binding". I'm about to explain the difference and show you how to achieve run-time binding, but before I do, let me tell you that

  1. You probably want compile-time binding.
  2. Compile-time binding is faster.
  3. Run-time binding is sometimes required, but when persons new to pointers think they need run-time binding, they usually do not.

Let me define compile-time and run-time binding:

  1. Binding refers to establishing addresses corresponding to names and names(). The names are said to be bound to the address.

  2. In compile-time binding, the addresses are established at the time the code is compiled.

    More correctly, compile-time binding does not really occur at the time the code is compiled, it occurs when the code is brought together for execution, an act called linking and which happens automatically in Mata. This is a fine and unimportant distiction, but I do not want you to think that all the functions have to be compiled at the same time or that the order in which they are compiled matters.

    In compile-time binding, if any functions are missing when the code is brought together for execution, and error message is issued.

  3. In run-time binding, the addresses are established at the time the code is executed (run), which happens after compilation, and after linking, and is an explicit act performed by you, the programmer.

To obtain the address of a variable or function at run-time, you use built-in function findexternal(). findexternal() takes one argument, a string scalar, containing the name of the object to be found. The function looks up that name and returns the address corresponding to it, or it returns NULL if the object cannot be found. NULL is the word used to mean invalid memory address and is in fact defined as equaling zero.

findexternal() can be used only with globals. The other variables that appear in your program might appear to have names, but those names are used solely by the compiler and, in the compiled code, these "stack-variables" or "local variables" are referred to by their addresses. The names play no other role and are not even preserved, so findexternal() cannot be used to obtain their addresses. There would be no reason you would want findexternal() to find their addresses because, in all such cases, the ampersand prefix is a perfect substitute.

Functions, however, are global, so we can look up functions. Watch:

: findexternal("x2()")
  0x19551ef8

Compare that with

: &x2()
  0x19551ef8

It's the same result, but they were produced differently. In the findexternal() case, the 0x19551ef8 result was produced after the code was compiled and assembled. The value was obtained, in fact, by execution of the findexternal() function.

In the &x2() case, the 0x19551ef8 result was obtained during the compile/assembly process. We can better understand the distinction if we look up a function that does not exist. I have no function named x4(). Let's obtain x4()'s address:

: findexternal("x4()")
  0x0

: &x4()
         <istmt>:  3499  x4() not found

I may have no function named x4(), but that didn't bother findexternal(). It merely returned 0x0, another way of saying NULL.

In the &x4() case, the compiler issued an error. The compiler, faced with evaluating &x4(), could not, and so complained.

Anyway, here is how we could write tryit() with run-time binding using the findexternal() function:

void function tryit() 
{
        pointer(real scalar function) scalar fn
        pointer(real scalar function) vector func
        real scalar                          i

        func = (findexternal("x2()"), findexternal("x3()")

        for(i=1;i<=length(func);i++) {
                fn = func[i]
                (*fn)(4)
        }
}

To obtain run-time rather than compile-time bindings, all I did was change the line

        func = (&x2(), &x3())

to be

        func = (findexternal("x2()"), findexternal("x3()")

Or we could write it this way:

void function tryit() 
{
        pointer(real scalar function) scalar fn
        string vector                        func
        real scalar                          i

        func = ("x2()", "x3()")

        for(i=1;i<=length(func);i++) {
                fn = findexternal(func[i])
                (*fn)(4)
        }
}

In this variation, I put the names in a string vector just as Kit did originally. Then I changed the line that Kit wrote,

        fn = &(func[i])

to read

        fn = findexternal(func[i])

Either way you code it, when performing run-time binding, you the programmer should deal with what is to be done if the function is not found. The loop

for(i=1;i<=length(func);i++) {
        fn = findexternal(func[i])
        (*fn)(4)
}

would better read

for(i=1;i<=length(func);i++) {
        fn = findexternal(func[i])
        if (fn!=NULL) {
                (*fn)(4)
        }
        else {
                ...
        }
}

Unlike C, if you do not include the code for the not-found case, the program will not crash if the function is not found. Mata will give you an "invalid use of NULL pointer" error message and a traceback log.

If you were writing a program in which the user of your program was to pass to you a function you were to use, such as a likelihood function to be maximized, you could write your program with compile-time binding by coding,

function myopt(..., pointer(real scalar function) scalar f, ...)
{
        ...
        ... (*f)(...) ...
        ...
}

and the user would call you program my coding myopt(..., &myfunc(), ...), or you could use run-time binding by coding

function myopt(..., string scalar fname, ...)
{
        pointer(real scalar function) scalar f
        ...

        f = findexternal(fname)
        if (f==NULL) {
                errprintf("function %s() not found\n", fname)
                exit(111)
        }
        ...
        ... (*f)(...) ...
        ...
}

and the user would call your program by coding myopt(..., "myfunc()", ...).

In this case I could be convinced to prefer the run-time binding solution for professional code because, the error being tolerated by Mata, I can write code to give a better, more professional looking error message.

Categories: Mata Tags: , ,

Automating web downloads and file unzipping

Andrew J. Dyck wrote a nice post on his blog on how to Download and unzip data files from Stata. He writes

Recently, I’ve been using Stata’s -shp2dta- command to convert some shapefiles to stata format, grabbing Lat/Lon data and merging into another dataset. There were several compressed shapefiles I wanted to download contained in a directory from the web. I could manually download each file and uncompress each one but that would be time consuming. Also, when the maps are updated, I’d have to do the download/uncompress all over again. I’ve found that the process can be automated from within Stata by using a combination of -shell- and some handy terminal commands. …

You should read the rest of his post. He goes on to show how you can script with Stata to automate shelling out to download and unzip a series of files from a website, and he introduces you to some cool Unix-like utilities for Windows.

We here at StataCorp use Stata for tasks like this all the time. In fact, we have built some tools into Stata to allow you to do much of what Andrew described without ever having to leave or shell out of Stata.

For example, Stata can access files over the Internet. Stata has a copy command. And, as of Stata 11, Stata can directly zip and unzip files and directories.

Putting all of those capabilities to use, you can accomplish Andrew’s goal by writing code directly in Stata such as

copy http://example.com/download1.zip download1.zip
copy http://example.com/download2.zip download2.zip
unzipfile download1.zip
unzipfile download2.zip

If there were a large number of files you wished to download and unzip, and they were all named in a regular manner (say, “download1.zip” through “download100.zip”), you could bring them all down and unzip them directly in Stata with a 4 line loop:

forvalues i = 1/100 {
    copy http://example.com/download`i'.zip download`i'.zip
    unzipfile download`i'.zip
}
Categories: Programming Tags: , , , ,

Mata, the missing manual, available at SSC

I gave a 1.5 hour talk on Mata at the 2010 UK Stata Users Group Meeting in September. The slides are available in pdf form here. The talk was well received, which of course pleased me. If you’re interested in Mata, I predict you will find the slides useful even if you didn’t attend the meeting.

The problem with the Mata Reference Manual is that, even though it tells you all the details, it never tells you how to put it all together, nor does it motivate you. We developers at StataCorp love the manual for just that reason: It gets right to the details that are so easy to forget.

Anyway, in outline, the talk and slides work like this

  1. They start with the mechanics of including Mata code. It begins gently, at the end of Stata’s NetCourse 151, and ends up discussing big — really big — systems.
  2. Next is a section on appropriate and inappropriate use of Mata.
  3. That’s followed by Mata concepts, from basic to advanced.
  4. And the talk includes a section on debugging!

I was nervous about how the talk would be received before I gave it. It’s been on my to-do list to write a book on Mata, but I never really found a way to approach the subject. The problem is that it’s all so obvious to me that I tend to launch immediately into tedious details. I wrote drafts of a few chapters more than once, and even I didn’t want to reread them.

I don’t know why this overview approach didn’t occur to me earlier. My excuse is that it’s a strange (I claim novel) combination of basic and advanced material, but it seems to work. I titled the talk “missing manual” with the implied promise that I would write that book if the talk was well received. It was. Nowadays, I’m not promising when. Real Soon Now.

The materials for all the talks, not just mine, are available at the SSC(*) and on www.stata.com. For the UK 2010, go to http://ideas.repec.org/s/boc/usug10.html or http://www.stata.com/meeting/uk10/abstracts.html. For other User Group Meetings, it’s easiest to start at the Stata page Meeting Proceedings.

If you have questions on the material, the appropriate place to post them is Statalist. I’m a member and am likely to reply, and that way, others who might be interested get to see the exchange, too. Please use “Mata missing manual” as the subject so that it will be easy for nonmembers to search the Statalist Archives and find the thread.

Finally, my “Stata, the missing manual” talk has no connection with the fine Missing-Manual series, “the book that should have been in the box”, created by Pogue Press and O’Reilly Media, whose website is http://missingmanuals.com/.

* The SSC is the Statistical Software Components archive, often called the Boston College Archive, provided by http://www.repec.org/. The SSC has become the premier Stata download site for user-written software on the Internet and also archives proceedings of Stata Users Group meetings and conferences.

Categories: Mata Tags: , , , ,