How to automate common tasks
Automating common tasks is crucial to effective data analysis. Automation saves you lots of time from repeating the same sets of operations, and it reduces errors by reducing what you have to repeat.
Let’s automate something using Stata. The task we are automating doesn’t much matter. What matters is that we get comfortable with how to automate tasks.
We will automate the simple task of normalizing a variable. That is to say, subtracting the variable’s mean and dividing by its standard deviation.
Just so you know, there are already community-contributed commands to do this and to do it more flexibly than we will. Type search normalize variable in Stata, and you will see one of those commands. (You will see things about other types of normalization that have nothing to do with normalizing a variable, but the command of interest is easy to pick out.) You can also normalize a single variable using Stata’s egen command, but we are going to do more than that.
As with all the articles in this series, I assume the reader is new to automating tasks in Stata. So, if you are already an expert, these articles may hold little interest for you. Or perhaps you will still find something novel.
Scripting
First, we will just perform the normalization directly in the middle of our analysis script. In Stata, we call analysis scripts do-files because they do something.
Let’s normalize the variable named x. I don’t like to change the content of existing variables, so I am going to create a new variable xN, where the N suffix indicates normalization. If you don’t like the N suffix, use something else, perhaps _norm. Or use a prefix. Stata’s summarize command will give us the mean and standard deviation.
... summarize x generate xN = (x - r(mean)) / r(sd) ...
That’s it. It takes only two lines to normalize a variable.
What are r(mean) and r(sd), and how did I know about them? In Stata, almost all commands return results. Estimation commands return their results as e() values, and most other commands return their results as r() values. I learned the names r(mean) and r(sd) by typing help summarize and scrolling to the bottom of the help file. There I found all the results returned by summarize and their descriptions. I could also have simply typed return list after the summarize command. return list shows us each returned result and its value. Or I could have gone to the full manual entry for summarize and read about the returned results there. You don’t even need Stata to see the help file or documentation.
Click on this to see the help file,
https://www.stata.com/help.cgi?summarize
Or click on this to see the manual entry,
https://www.stata.com/manuals/rsummarize.pdf
That’s the quick way. If you just want to browse Stata’s documentation, click on
https://www.stata-press.com/manuals/documentation-set/
Click on your manual of interest, and browse the table of contents.
Sorry for the digression, but finding things is important. Back to our script.
Our task is only two lines long. Why on earth would we automate it? Even in those two lines, there is ample room for error. If you block copy the code to normalize another variable, or say you block copy it 100 times to normalize 100 other variables, take care that you change x to your new variable name everywhere it must be changed. Forget to change it in summarize, and your new variable, say, y, is normalized by x‘s mean and standard deviation. Forget to change it in xN, and you get an error message. Forget to change it in the expression for xN, and your new variable will be x normalized by y‘s mean and standard deviation. I have made all of these mistakes.
And you’re still listening to me?
Do-file automation
Let’s put our script into its own do-file.
version 15.1 summarize x generate xN = (x - r(mean)) / r(sd)
I added one thing. The version command at the top. Always, always, always version your do-files. I am running Stata 15.1, so that is what I put at the top. If I do that, this script will always work the way it does today, even if some future Stata, say, Stata version 42, does away with the summarize command or completely changes how summarize works.
We run our new script by typing
. do normalize
or by putting do normalize in an analysis do-file.
Our current normalize.do is not too interesting. We need it to work on variables other than x.
Here is a version that does just that:
version 15.1 summarize `1' generate `1'N = (`1' - r(mean)) / r(sd)
We then type
. do normalize y
What changed from (1) to (2)? All we did was replace every occurrence of x with `1′. Why `1′? Stata’s do-files parse their arguments into local macros numbered 1, 2, 3, and so on. The first argument goes into local macro 1, the second into 2, and so on. What is a local macro? It’s just a name that holds a value. Yes, 1 can be a local macro name. Why do we surround the 1 with a left tick and a right tick? If we just type 1, that would be the number 1. We need the value in 1, so we dereference it. Dereference is just a fancy word for get its value. Because we typed do normalize y, our first (and only) argument is y, so `1′ dereferences to y. If you don’t like the word “dereference”, just say `1′ expands to y.
When you substitute y for `1′ in our second version of normalize.do, it becomes the first version. That is exactly what Stata does.
With our new normalize.do, we can gleefully type
. do normalize myvariable . do normalize myothervariable . do normalize x1 . do normalize x2 ... . do normalize x100
I am much less likely to make mistakes.
There is still a lot of redundant typing. We’ll return to that later.
What I want to ask now is, Can we make this do-file respect Stata’s if qualifier? The answer must be “yes”, and easily. Otherwise, I wouldn’t have asked.
Why do we want an if qualifier? We might like to type
. do normalize income if male == 0
and restrict our normalization to females in the sample. That is what if male == 0 says.
Here’s a do-file that respects both the if and in qualifiers.
(If you don’t know what an in qualifier is, click on https://www.stata.com/help.cgi?in to see.)
version 15.1 syntax varlist(min=1 max=1) [if] [in] summarize `varlist' `if' `in' generate `varlist'N = (`varlist' - r(mean)) / r(sd) `if' `in'
When we look at the last two lines of code, the ones adapted from our previous do-file, we see two changes—`1′ has been replaced everywhere with `varlist’, and both commands have `if’ `in’ added to the end of the command. Because we claimed our do-file now directly supports if and in qualifiers, that new syntax command seems to be performing a lot of magic, and indeed it is.
syntax parses commands that look like standard Stata commands. That is to say, commands that have a variable list (varlist), an optional if qualifier, an optional in qualifier, and options. We don’t have any options yet, but we have everything else. I am simplifying here; syntax can do even more.
What is truly cool about the syntax command is that you basically type what your command itself looks like, and syntax parses the command line, filling in local macros with relevant pieces of your syntax. It also issues error messages when what is typed does not match the syntax you have specified. That is why we went to the trouble to add (min=1 max=1) to varlist on our syntax command. We could have just typed varlist, but then syntax would have allowed more than one variable to be specified. And that would not work on our generate command. We want only one variable. The if and in qualifiers are optional, and that is why they appear in brackets on the syntax command. If we had typed if in rather than [if] [in], the if and in qualifiers would be required. Requiring both would be rare, but I have made the if qualifier required on some commands.
There was a problem with our first two do-files that I ignored. I never checked that `1′ was an unabbreviated variable name. Stata allows abbreviated variable names. If you have a variable foreign and no other variables that are abbreviated to for, then typing
. do normalize for
would have created the new variable forN, not foreignN. You may be fine with that; you may not. Regardless, you would have to be careful. There are ways to fix that in our earlier versions, but we won’t bother.
Our current version does not have that problem. Even if for is typed on the command line, `varlist’ expands to the unabbreviated variable name foreign. That is part of the magic of syntax.
Now, back to that redundant typing. What if we want to normalize a bunch of variables en masse? That, too, is easy enough, but we must finally add to our two lines of computational code.
Here’s a do-file that takes a list of variables and normalizes each while respecting if and in.
version 15.1 syntax varlist [if] [in] foreach var in `varlist' { summarize `var' `if' `in' generate `var'N = (`var' - r(mean)) / r(sd) `if' `in' }
Taking it from the top. We removed the (min=1 max=1) from the syntax command because now we want to accept a varlist.
The foreach command is new but easy to understand. For each var in the variable list, run the two commands we have been running all along. `varlist’ expands to the list of variables specified to our do-file. var is just a name to hold a single variable name as we loop over them one at a time. We could have used variable, or just v, or even z, it would not matter.
In our summarize commands, we now use `var’, so we are accessing a single variable.
That’s it.
We can now type things like
. do normalize x1 x2 x3 if male==0
or
. do normalize x*
normalize.do will now take valid Stata varlists. If you don’t already know, click on https://www.stata.com/help.cgi?varlist to see what all that means.
Creating a new command
Our little automation process has led to something pretty flexible and useful. Perhaps too useful to keep it as a do-file. Maybe we should turn it into a new Stata command that we can use on any of our projects or even share with our colleagues.
Again, if that were hard, I would not have raised the possibility.
We are going to create an ado-file, an automated do-file. A program defined in an ado-file acts like a new command in Stata. It is automatically found and run.
Without further ado, here it is,
program normalize version 15.1 syntax varlist [if] [in] foreach var in `varlist' { summarize `var' `if' `in' generate `var'N = (`var' - r(mean)) / r(sd) `if' `in' } end
What did we do? We indented the code from version (4) of our do-file, but that was just for prettiness. We added program normalize at the top of the file. We added end at the bottom of the file. Those latter two things say to treat this as a command, so we do not have to type do in front of it.
At the risk of being repetitive, that’s it.
We now have a program that will be automatically found and run whenever we type normalize.
We can now type
. normalize x1 x2 x3 if male==0
or
. normalize x*
We can give the file normalize.ado to our colleagues, and it will work for them too.
Now get out there and automate some tasks of your own.
Some bookkeeping
I say that normalize.ado will just be automatically available. It will be, if you put it where it can be found. If it is in you current working directory, it can be found. But you may not want to place it in each working directory. And what if you make it better? Then you have to change it in several places. Instead, in Stata, type
. adopath
One of the directories on that path will be labeled (PERSONAL). Copy normalize.ado there. It will now be found in all of your projects, regardless of what directory you are working in.
If you give normalize.ado to colleagues, tell them to copy it to their (PERSONAL) directories.
I don’t always take automation this far. I have found it useful to stop at versions (1), (2), (3), or (4) of our do-files. Or go all the way to a new command.
Also, it was no accident that we called the program program normalize and put it in the file normalize.ado. The name of the program and the name of the file must be the same.
One more detail. Your do-file is reloaded from the normalize.do file every time you type do normalize …. Your ado-file program stays in Stata’s memory after you type normalize …. The next time you type normalize, Stata runs the program from memory without rereading the normalize.ado file. Great, that is faster. But … if you are debugging your program and editing the file, your changes will not be reloaded. You need to type discard before typing normalize …. That way, your program will be dropped from memory and will be reloaded from your file.
There are easy ways to share your new commands with the whole Stata Community. Take a look at the FAQ How do I share a new command with Stata users?
Typical process
Here is a typical automation process:
1. Code the solution to a specific problem.
a. Find you are copying that code over and over.
b. Ask what you change from one problem to another.
2. Write a do-file that takes those things that change as arguments.
a. Refine.
b. Test.
c. Repeat 2a and 2b until happy.
3. Maybe turn your do-file into an ado-file.
4. Maybe share your ado-file with your colleagues.
5. Maybe share your ado-file with the whole Stata community.
Congratulations, you can now automate common tasks in Stata. Whether you meant to or not, you’re on your way to becoming a programmer. Grab yourself a highly caffeinated beverage.
If you’re happy with what we have done so far, this would be a fine time to quit reading.
Addendum: One more addition
You might not like automatically attaching an N to the end of your original variable name to designate the normalized variable. Maybe you would like to use a different letter, or maybe a set of characters, say, _norm. Or you might prefer a prefix to a suffix. Goodness, maybe you want both.
We can accommodate that.
program normalize version 15.1 syntax varlist [if] [in] [ , prefix(name) suffix(name) ] foreach var in `varlist' { summarize `var' `if' `in' generate `prefix'`var'`suffix' = (`var' - r(mean)) / r(sd) `if' `in' } end
From version (a) to version (b), all we did was change
syntax varlist [if] [in]
to
syntax varlist [if] [in] [, prefix(name) suffix(name)]
and change
generate `var'N = ...
to
generate `prefix'`var'`suffix' = ...
Let’s understand the changes to the syntax line.
The square brackets again mean optional; users do not have to type anything here.
If they do type anything, they must first type a comma ,. They can then type either prefix(pstuff) or suffix(sstuff), or both. If they type prefix(pstuff), then the local macro prefix will contain whatever they type within the parentheses—pstuff. The local macro suffix will contain whatever users type in the parentheses of the suffix option.
We were careful when we wrote our syntax command. Because we wrote (name) and not (string), users cannot type just anything in the parentheses. Whatever they type must be a legal Stata variable name. We intend to use what is typed as a prefix or suffix to a variable name, so that string must itself not contain anything that would be illegal in a variable name.
Now, what does
generate `prefix'`var'`suffix' = ...
mean?
The macros `prefix’ and `suffix’ are just expanded to whatever the user typed in the prefix and suffix options. Our new variable will have the prefix and the suffix that the user typed.
With our new ado-file, we can now type things like
. normalize x1 x2 x3 x4 , prefix(norm_of_) . normalize x* , prefix(norm_of_)
The first line creates four new variables: norm_of_x1, norm_of_x2, norm_of_x3, norm_of_x4. I am not fond of those names, but it’s pretty clear what they mean. Unless you are thinking matrices. Sometimes, these are called standardized variables, so you might prefer prefix(std_).
In the second line, x* matches all variables that begin with x. Each of them will be normalized and a new variable created with the designated prefix norm_of_.
You might have noticed a lurking bug. If the user types neither a prefix() nor a suffix() option, then both `prefix’ and `suffix’ will be blank. Our generate command is going to try to create a variable with the same name as the original variable. And that … is a syntax error.
One way to avoid that error is to default to our original behavior of suffixing the new variable with an “N”. We do that by adding the following three lines right below our syntax line,
if "`prefix'`suffix'" == "" { local suffix "N" }
They simply say, if both prefix and suffix (`prefix’`suffix’) are empty, then assign “N” to suffix.
Another nice little improvement would be to add a label to our new variable. Here’s a possibility:
label variable `prefix'`var'`suffix' "`var' normalized"
Which we would add right after the generate command. Inside the for loop.
And that is how programs become long. You improve them, and you add features. Keep at this, and you will soon be writing blocks of code that intimidate your colleagues.