Archive

Archive for the ‘Programming’ Category

A tip to debug your nl/nlsur function evaluator program

If you have a bug in your evaluator program, nl will produce, most probably, the following error:

your program returned 198
verify that your program is a function evaluator program
r(198);

The error indicates that your program cannot be evaluated.

The best way to spot any issues in your evaluator program is to run it interactively. You just need to define your sample (usually observations where none of the variables are missing), and a matrix with values for your parameters. Let me show you an example with nlces2. This is the code to fit the CES production function, from the documentation for the nl command:

cscript
program nlces2
version 12
syntax varlist(min=3 max=3) if, at(name)
local logout : word 1 of `varlist'
local capital : word 2 of `varlist'
local labor : word 3 of `varlist'
// Retrieve parameters out of at matrix
tempname b0 rho delta
scalar `b0' = `at'[1, 1]
scalar `rho' = `at'[1, 2]
scalar `delta' = `at'[1, 3]
tempvar kterm lterm
generate double `kterm' = `delta'*`capital'^(-1*`rho') `if'
generate double `lterm' = (1-`delta')*`labor'^(-1*`rho') `if'
// Fill in dependent variable
replace `logout' = `b0' - 1/`rho'*ln(`kterm' + `lterm') `if'
end

webuse production, clear
nl ces2 @ lnoutput capital labor, parameters(b0 rho delta) ///
               initial(b0 0 rho 1 delta 0.5)

Now, let me show you how to run it interactively:

webuse production, clear
*generate a variable to restrict my sample to observations
*with non-missing values in my variables
egen u = rowmiss(lnoutput capital labor)

*generate a matrix with parameters where I will evaluate my function
mat M = (0,1,.5)
gen nloutput_new = 1 
nlces2 nloutput_new capital labor if u==0, at(M)  

This will evaluate the program only once, using the parameters in matrix M. Notice that I generated a new variable to use as my dependent variable. This is because the program nlces2, when run by itself, will modify the dependent variable.
When you run this program by itself, you will obtain a more specific error message. You can add debugging code to this program, and you can also use the trace setting to see how each step is executed. Type help trace to learn about this setting.

Another possible source of error (which will generate error r(480) when run from nl) is when an evaluator function produces missing values for observations in the sample. If this is the case, you will see those missing values in the variable nloutput_new, i.e., in the variable you entered as dependent when running your evaluator by itself. You can then add debugging code, for example, using codebook or summarize to examine the different parts that contribute to the substitution performed in the dependent variable.

For example, after the line that generates `kterm’, I could write

summarize `kterm' if u == 0

to see if this variable contains any missing values in my sample.

This method can also be used to debug your function evaluator programs for nlsur. In order to preserve your dataset, you need to use copies for all the dependent variables in your model.

Categories: Programming Tags: , , ,

Advanced Mata: Pointers

I’m still recycling my talk called “Mata, The Missing Manual” at user meetings, a talk designed to make Mata more approachable. One of the things I say late in the talk is, “Unless you already know what pointers are and know you need them, ignore them. You don’t need them.” And here I am writing about, of all things, pointers. Well, I exaggerated a little in my talk, but just a little.

Before you take my previous advice and stop reading, let me explain: Mata serves a number of purposes and one of them is as the primary langugage we at StataCorp use to implement new features in Stata. I’m not referring to mock ups, toys, and experiments, I’m talking about ready-to-ship code. Stata 12′s Structural Equation Modeling features are written in Mata, so is Multiple Imputation, so is Stata’s optimizer that is used by nearly all estimation commands, and so are most features. Mata has a side to it that is exceedingly serious and intended for use by serious developers, and every one of those features are available to users just as they are to StataCorp developers. This is one of the reasons there are so many user-written commands are available for Stata. Even if you don’t use the serious features, you benefit.

So every so often I need to take time out and address the concerns of these user/developers. I knew I needed to do that now when Kit Baum emailed a question to me that ended with “I’m stumped.” Kit is the author of An Introduction to Stata Programming which has done more to make Mata approachable to professional researchers than anything StataCorp has done, and Kit is not often stumped.

I have a certain reptutation about how I answer most questions. “Why do you want to do that?” I invariably reply, or worse, “You don’t want to do that!” and then go on to give the answer to the question I wished they had asked. When Kit asks a question, however, I just answer it. Kit asked a question about pointers by setting up an artificial example and I have no idea what his real motivation was, so I’m not even going to try to motivate the question for you. The question is interesting in and of itself anyway.

Here is Kit’s artificial example:

real function x2(real scalar x) return(x^2)
 
real function x3(real scalar x) return(x^3) 

void function tryit() 
{
        pointer(real scalar function) scalar fn
        string rowvector                     func
        real scalar                          i

        func = ("x2", "x3")
        for(i=1;i<=length(func);i++) {
                fn = &(func[i])
                (*fn)(4)
        }
}

Kit is working with pointers, and not just pointers to variables, but pointers to functions. A pointer is the memory address, the address where the variable or function is stored. Real compilers translate names into memory addresses which is one of the reasons real compilers produce code that runs fast. Mata is a real compiler. Anyway, pointers are memory addresses, such as 58, 212,770, 427,339,488, except the values are usually written in hexadecimal rather than decimal. In the example, Kit has two functions, x2(x) and x3(x). Kit wants to create a vector of the function addresses and then call each of the functions in the vector. In the artificial example, he's calling each with an argument of 4.

The above code does not work:

: tryit()
         tryit():  3101  matrix found where function required
         <istmt>:     -  function returned error

The error message is from the Mata compiler and it's complaining about the line

        (*fn)(4)

but the real problem is earlier in the tryit() code.

One corrected version of tryit() would read,

void function tryit()
{
        pointer(real scalar function) scalar fn
        pointer(real scalar function) vector func     // <---
        real scalar                          i

        func = (&x2(), &x3())                         // <---
        for(i=1;i<=length(func);i++) {
                fn = func[i]                          // <---
                (*fn)(4)
        }
}

If you make the three changes I marked, tryit() works:

: tryit()
  16
  64

I want to explain this code and alternative ways the code could have been fixed. It will be easier if we just work interactively, so let's start all over again:

: real scalar x2(x) return(x^2)

: real scalar x3(x) return(x^3)

: func = (&x2(), &x3())

Let's take a look at what is in func:

: func
                1            2
    +---------------------------+
  1 |  0x19551ef8   0x19552048  |
    +---------------------------+

Those are memory addresses. When we typed &x2() and &x3() in the line

: func = (&x2(), &x3())

functions x2() and x3() were not called. &x2() and &x3() instead evaluate to the addresses of the functions named x2() and x3(). I can demonstrate this:

: &x2()
  0x19551ef8

0x19551ef8 is the memory address of where the function x2() is stored. 0x19551ef8 may not look like a number, but that is only because it is presented in base 16. 0x19551ef8 is in fact the number 425,008,888, and the compiled code for the function x2() starts at the 425,008,888th byte of memory and continues thereafter.

Let's assign to fn the value of the address of one of the functions, say x2(). I could do that by typing

: fn = func[1]

or by typing

: fn = &x2()

and either way, when I look at fn, it contains a memory address:

: fn
  0x19551ef8

Let's now call the function whose address we have stored in fn:

: (*fn)(2)
  4

When we call a function and want to pass 2 as an argument, we normally code f(2). In this case, we substitute (*fn) for f because we do not want to call the function named f(), we want to call the function whose address is stored in variable fn. The operator * usually means multiplication, but when * is used as a prefix, it means something different, in much the same way the minus operator - can be subtract or negate. The meaning of unary * is "the contents of". When we code *fn, we mean not the value 425,008,888 stored in fn, we mean the contents of the memory address 425,008,888, which happens to be the function x2().

We type (*fn)(2) and not *fn(2) because *fn(2) would be interpreted to mean *(fn(2)). If there were a function named fn(), that function would be called with argument 2, the result obtained, and then the star would take the contents of that memory address, assuming fn(2) returned a memory address. If it didn't, we'd get a type mismatch error.

The syntax can be confusing until you understand the reasoning behind it. Let's start with all new names. Consider something named X. Actually, there could be two different things named X and Mata would not be confused. There could be a variable named X and there could be a function named X(). To Mata, X and X() are different things, or said in the jargon, have different name spaces. In Mata, variables and functions can have the same names. Variables and functions having the same names in C is not allowed -- C has only one name space. So in C, you can type

fn = &x2

to obtain the address of variable x2 or function x2(), but in Mata, the above means the address of the variable x2, and if there is no such variable, that's an error. In Mata, to obtain the address of function x2(), you type

fn = &x2()

The syntax &x2() is a definitional nugget; there is no taking it apart to understand its logic. But we can take apart the logic of the programmer who defined the syntax. & means "address of" and &thing means to take the address of thing. If thing is a name -- &name -- that means to look up name in the variable space and return its address. If thing is name(), that means look up name in the function space and return its address. They way we formally write this grammar is

 &thing, where 

 thing  :=   name
             name()
             exp

There are three possibilities for thing; it's a name or it's a name followed by () or it's an expression. The last is not much used. &2 creates a literal 2 and then tells you the address where the 2 is stored, which might be 0x195525d8. &(2+3) creates 5 and then tells you where the 5 is stored.

But let's get back to Kit's problem. Kit coded,

func = ("x2", "x3")

and I said no, code instead

func = (&x2(), &x3())

You do not use strings to obtain pointers, you use the actual name prefixed by ampersand.

There's a subtle difference in what Kit was trying to code and what I did code, however. In what Kit tried to code, Kit was seeking "run-time binding". I, however, coded "compile-time binding". I'm about to explain the difference and show you how to achieve run-time binding, but before I do, let me tell you that

  1. You probably want compile-time binding.
  2. Compile-time binding is faster.
  3. Run-time binding is sometimes required, but when persons new to pointers think they need run-time binding, they usually do not.

Let me define compile-time and run-time binding:

  1. Binding refers to establishing addresses corresponding to names and names(). The names are said to be bound to the address.

  2. In compile-time binding, the addresses are established at the time the code is compiled.

    More correctly, compile-time binding does not really occur at the time the code is compiled, it occurs when the code is brought together for execution, an act called linking and which happens automatically in Mata. This is a fine and unimportant distiction, but I do not want you to think that all the functions have to be compiled at the same time or that the order in which they are compiled matters.

    In compile-time binding, if any functions are missing when the code is brought together for execution, and error message is issued.

  3. In run-time binding, the addresses are established at the time the code is executed (run), which happens after compilation, and after linking, and is an explicit act performed by you, the programmer.

To obtain the address of a variable or function at run-time, you use built-in function findexternal(). findexternal() takes one argument, a string scalar, containing the name of the object to be found. The function looks up that name and returns the address corresponding to it, or it returns NULL if the object cannot be found. NULL is the word used to mean invalid memory address and is in fact defined as equaling zero.

findexternal() can be used only with globals. The other variables that appear in your program might appear to have names, but those names are used solely by the compiler and, in the compiled code, these "stack-variables" or "local variables" are referred to by their addresses. The names play no other role and are not even preserved, so findexternal() cannot be used to obtain their addresses. There would be no reason you would want findexternal() to find their addresses because, in all such cases, the ampersand prefix is a perfect substitute.

Functions, however, are global, so we can look up functions. Watch:

: findexternal("x2()")
  0x19551ef8

Compare that with

: &x2()
  0x19551ef8

It's the same result, but they were produced differently. In the findexternal() case, the 0x19551ef8 result was produced after the code was compiled and assembled. The value was obtained, in fact, by execution of the findexternal() function.

In the &x2() case, the 0x19551ef8 result was obtained during the compile/assembly process. We can better understand the distinction if we look up a function that does not exist. I have no function named x4(). Let's obtain x4()'s address:

: findexternal("x4()")
  0x0

: &x4()
         <istmt>:  3499  x4() not found

I may have no function named x4(), but that didn't bother findexternal(). It merely returned 0x0, another way of saying NULL.

In the &x4() case, the compiler issued an error. The compiler, faced with evaluating &x4(), could not, and so complained.

Anyway, here is how we could write tryit() with run-time binding using the findexternal() function:

void function tryit() 
{
        pointer(real scalar function) scalar fn
        pointer(real scalar function) vector func
        real scalar                          i

        func = (findexternal("x2()"), findexternal("x3()")

        for(i=1;i<=length(func);i++) {
                fn = func[i]
                (*fn)(4)
        }
}

To obtain run-time rather than compile-time bindings, all I did was change the line

        func = (&x2(), &x3())

to be

        func = (findexternal("x2()"), findexternal("x3()")

Or we could write it this way:

void function tryit() 
{
        pointer(real scalar function) scalar fn
        string vector                        func
        real scalar                          i

        func = ("x2()", "x3()")

        for(i=1;i<=length(func);i++) {
                fn = findexternal(func[i])
                (*fn)(4)
        }
}

In this variation, I put the names in a string vector just as Kit did originally. Then I changed the line that Kit wrote,

        fn = &(func[i])

to read

        fn = findexternal(func[i])

Either way you code it, when performing run-time binding, you the programmer should deal with what is to be done if the function is not found. The loop

for(i=1;i<=length(func);i++) {
        fn = findexternal(func[i])
        (*fn)(4)
}

would better read

for(i=1;i<=length(func);i++) {
        fn = findexternal(func[i])
        if (fn!=NULL) {
                (*fn)(4)
        }
        else {
                ...
        }
}

Unlike C, if you do not include the code for the not-found case, the program will not crash if the function is not found. Mata will give you an "invalid use of NULL pointer" error message and a traceback log.

If you were writing a program in which the user of your program was to pass to you a function you were to use, such as a likelihood function to be maximized, you could write your program with compile-time binding by coding,

function myopt(..., pointer(real scalar function) scalar f, ...)
{
        ...
        ... (*f)(...) ...
        ...
}

and the user would call you program my coding myopt(..., &myfunc(), ...), or you could use run-time binding by coding

function myopt(..., string scalar fname, ...)
{
        pointer(real scalar function) scalar f
        ...

        f = findexternal(fname)
        if (f==NULL) {
                errprintf("function %s() not found\n", fname)
                exit(111)
        }
        ...
        ... (*f)(...) ...
        ...
}

and the user would call your program by coding myopt(..., "myfunc()", ...).

In this case I could be convinced to prefer the run-time binding solution for professional code because, the error being tolerated by Mata, I can write code to give a better, more professional looking error message.

Categories: Mata Tags: , ,

Automating web downloads and file unzipping

Andrew J. Dyck wrote a nice post on his blog on how to Download and unzip data files from Stata. He writes

Recently, I’ve been using Stata’s -shp2dta- command to convert some shapefiles to stata format, grabbing Lat/Lon data and merging into another dataset. There were several compressed shapefiles I wanted to download contained in a directory from the web. I could manually download each file and uncompress each one but that would be time consuming. Also, when the maps are updated, I’d have to do the download/uncompress all over again. I’ve found that the process can be automated from within Stata by using a combination of -shell- and some handy terminal commands. …

You should read the rest of his post. He goes on to show how you can script with Stata to automate shelling out to download and unzip a series of files from a website, and he introduces you to some cool Unix-like utilities for Windows.

We here at StataCorp use Stata for tasks like this all the time. In fact, we have built some tools into Stata to allow you to do much of what Andrew described without ever having to leave or shell out of Stata.

For example, Stata can access files over the Internet. Stata has a copy command. And, as of Stata 11, Stata can directly zip and unzip files and directories.

Putting all of those capabilities to use, you can accomplish Andrew’s goal by writing code directly in Stata such as

copy http://example.com/download1.zip download1.zip
copy http://example.com/download2.zip download2.zip
unzipfile download1.zip
unzipfile download2.zip

If there were a large number of files you wished to download and unzip, and they were all named in a regular manner (say, “download1.zip” through “download100.zip”), you could bring them all down and unzip them directly in Stata with a 4 line loop:

forvalues i = 1/100 {
    copy http://example.com/download`i'.zip download`i'.zip
    unzipfile download`i'.zip
}
Categories: Programming Tags: , , , ,

Mata, the missing manual, available at SSC

I gave a 1.5 hour talk on Mata at the 2010 UK Stata Users Group Meeting in September. The slides are available in pdf form here. The talk was well received, which of course pleased me. If you’re interested in Mata, I predict you will find the slides useful even if you didn’t attend the meeting.

The problem with the Mata Reference Manual is that, even though it tells you all the details, it never tells you how to put it all together, nor does it motivate you. We developers at StataCorp love the manual for just that reason: It gets right to the details that are so easy to forget.

Anyway, in outline, the talk and slides work like this

  1. They start with the mechanics of including Mata code. It begins gently, at the end of Stata’s NetCourse 151, and ends up discussing big — really big — systems.
  2. Next is a section on appropriate and inappropriate use of Mata.
  3. That’s followed by Mata concepts, from basic to advanced.
  4. And the talk includes a section on debugging!

I was nervous about how the talk would be received before I gave it. It’s been on my to-do list to write a book on Mata, but I never really found a way to approach the subject. The problem is that it’s all so obvious to me that I tend to launch immediately into tedious details. I wrote drafts of a few chapters more than once, and even I didn’t want to reread them.

I don’t know why this overview approach didn’t occur to me earlier. My excuse is that it’s a strange (I claim novel) combination of basic and advanced material, but it seems to work. I titled the talk “missing manual” with the implied promise that I would write that book if the talk was well received. It was. Nowadays, I’m not promising when. Real Soon Now.

The materials for all the talks, not just mine, are available at the SSC(*) and on www.stata.com. For the UK 2010, go to http://ideas.repec.org/s/boc/usug10.html or http://www.stata.com/meeting/uk10/abstracts.html. For other User Group Meetings, it’s easiest to start at the Stata page Meeting Proceedings.

If you have questions on the material, the appropriate place to post them is Statalist. I’m a member and am likely to reply, and that way, others who might be interested get to see the exchange, too. Please use “Mata missing manual” as the subject so that it will be easy for nonmembers to search the Statalist Archives and find the thread.

Finally, my “Stata, the missing manual” talk has no connection with the fine Missing-Manual series, “the book that should have been in the box”, created by Pogue Press and O’Reilly Media, whose website is http://missingmanuals.com/.

* The SSC is the Statistical Software Components archive, often called the Boston College Archive, provided by http://www.repec.org/. The SSC has become the premier Stata download site for user-written software on the Internet and also archives proceedings of Stata Users Group meetings and conferences.

Categories: Mata Tags: , , , ,