The fourth quarter
The fourth quarter
David Drukker and I just got back from the Italian Stata Users Group meeting in Bologna, arranged by TStat, the Stata distributor for Italy. It was wonderful, in part because of the beauty of Bologna, and the tasty food. The scientific committee and TStat did great jobs of selecting papers and organizing a smooth, interesting meeting.
The first day of the meeting had talks by users and StataCorp. There was good variety, with topics like investigating disease clustering, classification of prehistoric artifacts, small-area analysis, and the careful interpretation of marginal effects. This year, all the talks were in English — and it was once again amazing to see how well people can present in a second (or third) language. If you would like to see the slides which accompanied the talks, you can find them at http://www.stata.com/meeting/italy10/abstracts.html.
Recently, I have been thinking about how to interpret results from nonlinear models, so I found Maarten Buis’s talk on “Extracting effects from non-linear models” and David’s talk on “Estimating partial effects using margins in Stata 11” really useful. Both Maarten and David have thought carefully about this problem and each of them presented great introductions and easy to apply solutions. What is interesting is they favor different solutions. Maarten leaned more towards estimating and interpreting ratios that did not vary with the covariates. David recommended using the potential outcome framework which can be implemented using the margins command. The similarities and differences in these two talks made them even more informative.
As is typical for the Italian meetings, the second day had two training sessions, one given by David on programming your own estimation command in Stata (starting from the basics of Stata programming), and one given by Laura Antolini from the Università di Milano Bicocca on competing risks in survival analysis. Both courses were booked full.
I was a Stata user for 15 years before I started working at Stata, and the most fun parts of the meeting are the same now as when I was a user: the wishes and grumbles followed by the conference dinner. The wishes and grumbles session is always interesting; it shows the wide variety of approaches to using Stata. The conference dinner is always fun, because of the conversation over excellent food. In Italy, of course, the food is beyond excellent; strolling through Bologna on marble sidewalks under colonnades while talking statistics, programming and Stata made the evening, if in a intellectual fashion.
I gave a 1.5 hour talk on Mata at the 2010 UK Stata Users Group Meeting in September. The slides are available in pdf form here. The talk was well received, which of course pleased me. If you’re interested in Mata, I predict you will find the slides useful even if you didn’t attend the meeting.
The problem with the Mata Reference Manual is that, even though it tells you all the details, it never tells you how to put it all together, nor does it motivate you. We developers at StataCorp love the manual for just that reason: It gets right to the details that are so easy to forget.
Anyway, in outline, the talk and slides work like this
- They start with the mechanics of including Mata code. It begins gently, at the end of Stata’s NetCourse 151, and ends up discussing big — really big — systems.
- Next is a section on appropriate and inappropriate use of Mata.
- That’s followed by Mata concepts, from basic to advanced.
- And the talk includes a section on debugging!
I was nervous about how the talk would be received before I gave it. It’s been on my to-do list to write a book on Mata, but I never really found a way to approach the subject. The problem is that it’s all so obvious to me that I tend to launch immediately into tedious details. I wrote drafts of a few chapters more than once, and even I didn’t want to reread them.
I don’t know why this overview approach didn’t occur to me earlier. My excuse is that it’s a strange (I claim novel) combination of basic and advanced material, but it seems to work. I titled the talk “missing manual” with the implied promise that I would write that book if the talk was well received. It was. Nowadays, I’m not promising when. Real Soon Now.
The materials for all the talks, not just mine, are available at the SSC(*) and on www.stata.com. For the UK 2010, go to http://ideas.repec.org/s/boc/usug10.html or http://www.stata.com/meeting/uk10/abstracts.html. For other User Group Meetings, it’s easiest to start at the Stata page Meeting Proceedings.
If you have questions on the material, the appropriate place to post them is Statalist. I’m a member and am likely to reply, and that way, others who might be interested get to see the exchange, too. Please use “Mata missing manual” as the subject so that it will be easy for nonmembers to search the Statalist Archives and find the thread.
Finally, my “Stata, the missing manual” talk has no connection with the fine Missing-Manual series, “the book that should have been in the box”, created by Pogue Press and O’Reilly Media, whose website is http://missingmanuals.com/.
* The SSC is the Statistical Software Components archive, often called the Boston College Archive, provided by http://www.repec.org/. The SSC has become the premier Stata download site for user-written software on the Internet and also archives proceedings of Stata Users Group meetings and conferences.
I was reviewing some timings from the Stata/MP Performance Report this morning. (For those who don’t know, Stata/MP is the version of Stata that has been programmed to take advantage of multiprocessor and multicore computers. It is functionally equivalent to the largest version of Stata, Stata/SE, and it is faster on multicore computers.)
What was unusual this morning is that I was running Stata/MP interactively. We usually run MP for large batch jobs that run thousands of timings on large datasets — either to tune performance or to produce reports like the Performance Report. That is the type of work Stata/MP was designed for — big jobs on big datasets.
I will admit right now that I mostly run Stata interactively using the auto dataset, which has 74 observations. I run Stata/MP using all 4 cores of my quad-core computer, but I am mostly wasting 3 of them — there is no speeding up the computations on 74 observations. This morning I was running Stata/MP interactively on a 24-core computer using a somewhat larger dataset.
After a while, I was struck by the fact that I wasn’t noticing any annoying delays waiting for commands to run. It felt almost as though I were running on the auto dataset. But I wasn’t. I was running commands using 50 covariates on 1 million observations! Regressions, summary statistics, etc.; this was fun. I had never played interactively with a million-observation dataset before.
Out of curiousity, I turned off multicore support. The change was dramatic. Commands that were taking less than a second were now taking longer, too long. My coffee cup was full, but I contemplated fetching a snack. Running on only one processor was not so much fun.
For your information, I set rmsg on and ran a few timings:
|Analysis||24 cores||1 core|
|generate a new variable||.03||.33|
|summarize 50 variables||.88||19.55|
|All timings are on a 1 million observation dataset.
The two regressions included 50 covariates.
OK, the timings with 24 cores are not quite the same as with the auto dataset, but well within comfortable interactive use.
Careful readers will have noticed that the 24-core and 1-core timings for twoway tabulation are the same. We have not rewritten the code for tabulate to support multiple cores, partly because tabulate is already very fast, and partly because the code for tabulate is isolated, so changing it will not improve the performance of other commands. Thus, parallelizing tabulate is on our long-run, not short-run, list of additions to Stata/MP. We have rewritten about 250 sections of Stata’s internal code to support Symmetric Multi Processing (SMP). Each rewritten section typically improves the performance of many commands.
I switched back to using all 24 cores and returned to my original work — stress testing changes in the number of covariates and observations. My fun was quelled when I started running some timings of Cox proportional hazards regressions. With my 50 covariates and 1 million observations, a Cox regression took just over two minutes. Most estimators in Stata are parallelized, including the estimators for parametric survival models. The Cox proportional hazards estimator is not. It is not parallelized because it uses a clever algorithm that requires sequential computations. When I say sequential I mean that some computations are wholly dependent on previous computations so that they simply cannot be performed simultaneously, in parallel. There are other algorithms for fitting the Cox model, but they are orders of magnitude slower. Even parallelized, they would not be faster than our current sequential algorithm unless run on 20 or more processors. When more computers start shipping with dozens of cores, we will evaluate adding a parallelized algorithm for the Cox estimator.
The computer I was running on is about a year old. There have been a spate of new and faster server-grade processors from Intel and AMD in the past year. You can get reasonably close to the performance of my 24-core computer using just 8-cores and the newer chips. That means that with a newer 32-core computer, I could increase my threshold for interactive analysis to about 4 million observations.
There are four speed comparisons above. To see 450 more, including graphs and a discussion of SMP and its implementation in Stata, see the Stata/MP white paper, a.k.a. the Stata/MP Performance Report.
Stata’s odbc command allows you to import data from and export data to any ODBC data source on your computer. ODBC is a standardized way for applications to read data from and write data to different data sources such as databases and spreadsheets.
Until now, before you could use the odbc command, you had to add a named data source (DSN) to the computer via the ODBC Data Source Administrator. If you did not have administrator privileges on your computer, you could not do this.
In the update to Stata 11 released 4 November 2010, a new option, connectionstring(), was added to the odbc command. This option allows you to specify an ODBC data source on the fly using an ODBC connection string instead of having to first add a data source (DSN) to the computer. A connection string lets you specify all necessary parameters to establish a connection between Stata and the ODBC source. Connection strings have a standard syntax for all drivers but there are also driver-specific keyword/value pairs that you can specify. The three standard things that you will probably need in a connection string are DRIVER, SERVER, and DATABASE. For example,
odbc load, … ///
If you also need to specify a username and password to get access to your database you would type
odbc load, …///
Again, there are driver specific keyword/value pairs you can add to the connection string. You can perform a search on the Internet for “connection string” and your database name to find what other options you can specify in the connection string. Just remember to separate each connection string keyword/value pair with a semicolon. You can read more about connection string syntax on Microsoft’s website.
To get this capability in your copy of Stata 11, simply type update all and follow the instructions to complete the update. You can then type help odbc to read more about the connectionstring() option.
I just want to take a moment to plug Statalist. I’m a member and I hope to convince you to join Statalist, too, but even if I don’t succeed, you need to know about the web-based Statalist Archives because they’re a great resource for finding answers to questions about Stata, and you don’t have to join Statalist to access them.
Statalist’s Archives are found at http://www.stata.com/statalist/archive/, or you can click on “Statalist archives” on the right of this blog page, under Links.
Once at the Archives page, you can click on a year and month to get an idea of the flavor of Statalist. More importantly, you can search the archives. The search is Powered by Google and works well for highly specific, directed inquiries. For generic searches such as random numbers or survival analysis, however, I prefer to go to Advanced Search and ask that the results be sorted by date instead of relevance. It’s usually the most recent postings that are the most interesting, and by-date results are listed in just that order.
Anyway, the next time you are puzzling over something in Stata, I suggest that Read more…
You probably already noticed the icons at the top right of the blog, but in case you didn’t, Stata is now on Facebook and Twitter. Follow us at @stata and join us on Facebook to keep up-to-date with the latest Stata-related happenings. We will share events, announcements, blog posts, and more.
We here at Stata are often asked to make recommendations on the “best” computer on which to run Stata, and such discussions sometimes pop up on Statalist. Of course, there is no simple answer, as it depends on the analyses a given user wishes to run, the size of their datasets, and their budget. And, we do not recommend particular computer or operating system vendors. Many manufacturers use similar components in their computers, and the choice of operating system comes down to personal preference of the user. We take pride in making sure Stata works well regardless of operating system and hardware configuration.
For some users, the analyses they wish to run are demanding, the datasets they have are huge, and their budgets are large. For these users, it is useful to know what kind of off-the-shelf hardware they can easily get their hands on. To give you an idea of what is available, HP makes a server with up to 1 TB of memory. Yes, 1 terabyte! This computer can be configured and ordered online at hp.com.
It can have up to 4 processors, each with 8 cores, for a total of 32 cores of processing power. A sample rack-mount configuration with the fastest 8-core Intel Xeon processors available for this computer and a full 1 TB of memory totals roughly $100,000. We mention HP because they were one of the first to allow such large memory configurations without going to a much more expensive completely custom-built solution. Wouldn’t you love to have one of these running Stata/MP (or Halo)?
You can run Windows or Linux on a computer like the above. If you prefer Mac OS X, the largest current configuration from Apple allows a total of 12 cores and 32 GB of memory. This is a tower case unit and costs around $10,000. Visit store.apple.com to configure such a computer.
The largest fastest laptops easily purchased these days allow up to 4 cores and 16 GB of RAM. That much power in a small package will cost you though, with such a configuration costing over $7,000. Here is one such example you can configure from Dell: dell.com.
We’ll keep you updated periodically with the state of the high end of the computer market as memory capacities and number of cores increase.
The 2011 Mexican Stata Users Group meeting has been scheduled for May 12, 2011.
The Mexican Stata Users Group meeting is a one-day international conference about the use of Stata in a wide breadth of fields and environments, mixing theory and practice. The bulk of the conference is made up of selected submitted presentations. Together with the keynote address and a featured presentation by a member of StataCorp’s technical staff, these sessions provide fertile ground for learning about statistics and Stata. All users are encouraged to submit abstracts for possible presentations.
For the full meeting details, submission guidelines, and registration information, please see www.stata.com/meeting/mexico11/.
|Date:||May 12, 2011|
|Venue:||Institute for Economic Research, National Autonomous University of
Mexico, Circuito Mario de la Cueva, Ciudad de la Investigación
Humanidades, Ciudad Universitaria, C.P.04510, México, D.F.
|Submission deadline:||March 19, 2011|
|More information:||click here|
Alfonso Miranda (chair)
Institute of Education, University of London
Armando Sánchez Vargas
Institute for Economic Research, National Autonomous University of Mexico
Graciela Teruel Belismelis
Economics Department, Iberoamerican University
MultiON Consulting SA de CV, distributor of Stata in Mexico and Central America
Phone: +52 (55) 5559 4050 x 160