Archive

Archive for December 2010

Including covariates in crossed-effects models

The manual entry for xtmixed documents all the official features in the command, and several applications. However, it would be impossible to address all the models that can be fitted with this command in a manual entry. I want to show you how to include covariates in a crossed-effects model.

Let me start by reviewing the crossed-effects notation for xtmixed. I will use the homework dataset from Kreft and de Leeuw (1998) (a subsample from the National Education Longitudinal Study of 1988). You can download the dataset from the webpage for Rabe-Hesketh & Skrondal (2008) (http://www.stata-press.com/data/mlmus2.html), and run all the examples in this entry.

If we want to fit a model with variable math (math grade) as outcome, and two crossed effects: variable region and variable urban, the standard syntax would be:

(1)   xtmixed math ||_all:R.region || _all: R.urban

The underlying model for this syntax is

math_ijk = b + u_i + v_j + eps_ijk

where i represents the region and j represents the level of variable urban, u_i are i.i.d, v_j are i.i.d, and eps_ijk are i.i.d, and all of them are independent from each other.

The standard notation for xtmixed assumes that levels are always nested. In order to fit non-nested models, we create an artificial level with only one category consisting of all the observations; in addition, we use the notation R.var, which indicates that we are including dummies for each category of variable var, while constraining the variances to be the same.

That is, if we write

xtmixed math  ||_all:R.region

we are just fitting the model:

xtmixed math || region:

but we are doing it in a very inefficient way. What we are doing is exactly the following:

generate one = 1
tab region, gen(id_reg)
xtmixed math || one: id_reg*, cov(identity) nocons

That is, instead of estimating one variance parameter, we are estimating four, and constraining them to be equal. Therefore, a more efficient way to fit our mixed model (1), would be:

xtmixed  math  ||_all:R.region || urban:

This will work because urban is nested in one. Therefore, if we want to include a covariate (also known as random slope) in one of the levels, we just need to place that level at the end and use the usual syntax for random slope, for example:

xtmixed math public || _all:R.region || urban: public

Now let’s assume that we want to include random coefficients in both levels; how would we do that? The trick is to use the _all notation to include a random coefficient in the model. For example, if we want to fit

(2) xtmixed math meanses || region: meanses

we are assuming that variable meanses (mean SES per school) has a different effect (random slope) for each region. This model can be expressed as

math_ik = x_ik*b + sigma_i + alpha_i*meanses_ik

where sigma_i are i.i.d, alpha_i are i.i.d, and sigmas and alphas are independent from each other. This model can be fitted by generating all the interactions of meanses with the regions, including a random alpha_i for each interaction, and restricting their variances to be equal. In other words, we can fit model (2) also as follows:

unab idvar: id_reg* 
foreach v of local idvar{
    gen inter`v' = meanses*`v'
}

xtmixed math  meanses ///
  || _all:inter*, cov(identity) nocons ///
  || _all: R.region

Finally, we can use all these tools to include random coefficients in both levels, for example:

xtmixed math parented meanses public || _all: R.region || ///
   _all:inter*, cov(identity) nocons || urban: public

References:
Kreft, I.G.G and de J. Leeuw. 1998. Introducing Multilevel Modeling. Sage.
Rabe-Hesketh, S. and A. Skrondal. 2008. Multilevel and Longitudinal Modeling Using Stata, Second Edition. Stata Press

Categories: Statistics Tags: ,

How to successfully ask a question on Statalist

As everyone knows, I am a big proponent of Statalist, and not just for selfish reasons, although those reasons play a role. Nearly every member of the technical staff at StataCorp — me included — are members of Statalist. Even when we don’t participate in a particular thread, we do pay attention. The discussions on Statalist play an important role concerning Stata’s development.

Statalist is a discussion group, not just a question-and-answer forum. Nonetheless, new members often use it to obtain answers to questions and that works because those questions sometimes become gist for subsequent discussions. In those cases, the questioners not only get answers, they get much more.

One of the best features of Statalist is that, no matter how poorly you ask a question, you are unlikely to be flamed. Not only are the members of Statalist nice — just as are the members of most lists — they act just as nice on the list as they really are. You are unlikely to be flamed if you ask a question poorly, but you are also unlikely to get an answer.

Here is my recipe to increase the chances of you getting a helpful response. You should also read the Statalist FAQ before writing your question.

Subject line

Make the subject line of your email meaningful. Some good subject lines are:

Survival analysis

Confusion about -stcox-

Unexpected error from -stcox-

-stcox- output

The first two sentences

The first two sentences are the most important, and they are the easiest to write.

In the first sentence, state your problem in Stata terms, but do not go into details. Here are some good first sentences:

I’m having a problem with -stcox-.

I’m getting an unexpected error message from -stcox-.

I’m using -stcox- and don’t know how to interpret the result.

I’m using -stcox- and getting a result I know is wrong, so I know I’m misunderstanding something.

I want to use -stcox- but don’t know how to start.

I think I want to use -stcox-, but I’m unsure.

I want to use -stcox- but my data is complicated and I’m unsure how to proceed.

I have a complicated dataset that I somehow need to transform into a form suitable for use with -stcox-.

Stata crashed!

I’m having a problem that may be more of a statistics issue than a Stata issue.

The purpose of the first sentence is to catch the attention of members who have an interest in your topic and let the others, who were never going to answer you anyway, move on.

The second sentence is even easier to write:

I am using Stata 11.1 for Windows.

I am using Stata 10 for Mac.

Even if you are certain that it’s unimportant which version of Stata you are using, state it anyway.

Write two sentences and you are done with the first paragraph.

The second paragraph

Now write more about your problem. Try not to be overly wordy, but it’s better to be wordy than curt to the point of unclearness. However you write this paragraph, be explicit. If you’re having a problem making Stata work, tell your readers exactly what you typed and exactly how Stata responded. For example,

I typed -stcox weight- and Stata responded “data not st”, r(119).

I typed -stcox weight sex- and Stata showed the usual output, except the standard error on weight was reported as dot.

The form of the second paragraph — which may extend into the third, fourth, … — depends on what you are asking. Describe the problem concisely but completely. Sacrifice conciseness for completeness if you must or you think it will help. To the extent possible, simplify your problem by getting rid of extraneous details. For instance,

I have 100,000 observations and 1,000 variables on firms, but 4
observations and 3 variables will be enough to show the problem.
My data looks like this

        firm_id     date      x
        -----------------------
          10043       17     12
          10043       18      5
          13944       17     10
          27394       16      1
        -----------------------

I need data that looks like this:

        date    no_of_firms   avg_x
        ---------------------------
          16              1       1
          17              2      11
          18              1      12

That is, for each date, I want the number of firms and the
average value of x.

Here’s another example for the second and subsequent paragraphs:

The substantive problem is this:  Patients enter and leave the 
hospital, sometimes more than once over the period.  I think  
in this case it would be appropriate to combine the 
separate stays so that a patient who was in for 2 days and 
later for 4 days could be treated as being simply in for 6 days,  
except I also record how many separate stays there were, too.

I'm evaluating cost, so for my purposes, I think treating 
cost as proportional to days in hospital, whatever their
distribution, will be adequate.  I'm looking at total days as a
function of number of stays.  The idea is that letting patients out
too early results in an increase in total days, and I want to
measure this.

I realize that more stays and days might also arise simply because
the patient was sicker.  Some patients die, and that obviously 
truncates stay, so I've omitted them from data.  I have disease
codes, but nothing about health status within code.  

Is there a way to incorporate this added information to improve  
the estimates?  I've got lots of data, so I was thinking of 
using death rate within disease code to somehow rank the codes 
as to seriousness of illness, and then using "seriousness" 
as an explanatory variable.  I guess my question is whether  
anyone knows a way I might do this. 

Or is there someway I could estimate the model seperately within
disease code, somehow constraining the coefficient on number of 
stays to be the same?  I saw something in the manual about 
stratified estimates, but I'm unsure if this is the same thing.

You’re asking someone to invest their time, so invest yours

Before you hit the send key, read what you have written, and improve it. You are asking for someone to invest their time helping you. Help them by making your problem easy to understand.

The easier your problem is to understand, the more likely you are to get a response. Said differently, if you write in a disorganized way so that potential responders must work just to understand you, much less provide you with an answer, you are unlikely to get an response.

Sparkling prose is not required. Proper grammar is not even required, so nonnative English speakers can relax. My advice is that, unless you are often praised for how clearly and entertainingly you write, write short sentences. Organization is more important than the style of the individual setences.

Avoid or explain jargon. Do not assume that the person who responds to your question will be in the same field as you. When dealing with a substantive problem, avoid jargon except for statistical jargon that is common across fields, or explain it. Potential responders like it when you teach them something new, and that makes them more likely to respond.

Tone

Write as if you are writing to a colleague whom you know well. Assume interest in your problem. The same thing said negatively: Do not write to list members as you might write to your research assistant, employee, servant, slave, or family member. Nothing is more likely to to get you ignored than to write, “I’m busy and really I don’t have time filter through all the Statalist postings, so respond to me directly, and soon. I need an answer by tomorrow.”

The positive approach, however, works. Just as when writing to a colleague, in general you do not need to apologize, beg, or play on sympathies. Sometimes when I write to colleagues, I do feel the need to explain that I know what I’m asking is silly. “I should know this,” I’ll write, or, “I can’t remember, but …”, or, “I know I should understand, but I don’t”. You can do that on Statalist, but it’s not required. Usually when I write to colleagues I know well, I just jump right in. The same rule works with Statalist.

What’s appropriate

Questions appropriate for Stata’s Technical Services are not appropriate for Statalist, and vice versa. Some questions aren’t appropriate for either one, but those are rare. If you ask an inappropriate question, and ask it well, someone will usually direct you to a better source.

Who can ask, and how

You must join Statalist to send questions. Yes, you can join, ask a question, get your answer, and quit, but if you do, don’t mention this at the outset. List members know this happens, but if you mention it when you ask the question, you’ll sound superior and condescending. Also, stick around for a few days after you get your response, because sometimes your question will generate discussion. If it does, you should participate. You should want to stick around and participate because if there is subsequent discussion, the final result is usually better than the initial reply.

I’ve previously written on how to join (and quit) Statalist. See http://blog.stata.com/2010/11/08/statalist/.

Categories: Resources Tags:

New Wooldridge edition just made available

Insiders have been waiting for the second edition of Econometric Analysis of Cross Section and Panel Data by Jeffrey M. Wooldridge. I have a copy and really recommend it; later I will write a review as to why.

The book is available at the Stata bookstore and the MIT Press bookstore. It is $84 at our bookstore and $94 at MIT. The book is not yet available from Amazon.

Automating web downloads and file unzipping

Andrew J. Dyck wrote a nice post on his blog on how to Download and unzip data files from Stata. He writes

Recently, I’ve been using Stata’s -shp2dta- command to convert some shapefiles to stata format, grabbing Lat/Lon data and merging into another dataset. There were several compressed shapefiles I wanted to download contained in a directory from the web. I could manually download each file and uncompress each one but that would be time consuming. Also, when the maps are updated, I’d have to do the download/uncompress all over again. I’ve found that the process can be automated from within Stata by using a combination of -shell- and some handy terminal commands. …

You should read the rest of his post. He goes on to show how you can script with Stata to automate shelling out to download and unzip a series of files from a website, and he introduces you to some cool Unix-like utilities for Windows.

We here at StataCorp use Stata for tasks like this all the time. In fact, we have built some tools into Stata to allow you to do much of what Andrew described without ever having to leave or shell out of Stata.

For example, Stata can access files over the Internet. Stata has a copy command. And, as of Stata 11, Stata can directly zip and unzip files and directories.

Putting all of those capabilities to use, you can accomplish Andrew’s goal by writing code directly in Stata such as

copy http://example.com/download1.zip download1.zip
copy http://example.com/download2.zip download2.zip
unzipfile download1.zip
unzipfile download2.zip

If there were a large number of files you wished to download and unzip, and they were all named in a regular manner (say, “download1.zip” through “download100.zip”), you could bring them all down and unzip them directly in Stata with a 4 line loop:

forvalues i = 1/100 {
    copy http://example.com/download`i'.zip download`i'.zip
    unzipfile download`i'.zip
}
Categories: Programming Tags: , , , ,