In nasdaq.dta, I have daily data on the NASDAQ index from February 5, 1971 to March 23, 2015 that I downloaded from the St. Louis Federal Reserve Economic Database (FRED).
. use http://www.stata.com/data/nasdaq . describe Contains data from http://www.stata.com/data/nasdaq.dta obs: 11,132 vars: 2 29 Jan 2016 16:21 size: 155,848 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- date str10 %10s Daily date index float %9.0g NASDAQ Composite Index (1971=100) ------------------------------------------------------------------------------- Sorted by:
date is the time variable in our data which is a string format ordered as year, month, and day. I use the date() function to convert the string daily date to a Stata numeric date and store the values in mydate. To find out more about converting string dates to numeric, you can read A Tour of Datetime in Stata.
. generate mydate = date(date,"YMD") . format %td mydate
I tsset this data with mydate as the time variable and then list the first five observations, along with the first lag of index.
. tsset mydate time variable: mydate, 05feb1971 to 23mar2015, but with gaps delta: 1 day . list date mydate index l.index in 1/5 +------------------------------------------+ | L.| | date mydate index index | |------------------------------------------| 1. | 1971-02-05 05feb1971 100 . | 2. | 1971-02-08 08feb1971 100.84 . | 3. | 1971-02-09 09feb1971 100.76 100.84 | 4. | 1971-02-10 10feb1971 100.69 100.76 | 5. | 1971-02-11 11feb1971 101.45 100.69 | +------------------------------------------+
The first observation on l.index is missing; I expect this because there are no observations prior to the first observation on index. However, the second observation on l.index is also missing. As you may have already noticed, the dates are irregularly spaced in my dataset — the first observation corresponds to a Friday and the second observation to Monday.
I get missing data in this case because mydate is a regular date, and tsset–ing by a regular date will treat all weekends and other holidays as if they are missing in the dataset instead of ignoring them in calculations. To avoid the problem of gaps inherent in business data, I can create a business calendar. Business calendars specify which dates are omitted. For daily financial data, a business calendar specifies the weekends and holidays for which the markets were closed.
Creating business calendars
Business calendars are defined in files named calname.stbcal. You can create your own calendars, use the ones provided by StataCorp, or obtain directly from other users or via the SSC. Calendars can also be created automatically from the current dataset using the bcal create command.
Every stbcal-file requires you to specify the following four things:
I begin by creating nasdaq.stbcal, which will omit Saturdays and Sundays of every month. I do this using the Do-file editor, but you can use any text editor.
version 14.1 purpose "Converting daily financial data into business calendar dates" dateformat dmy range 05feb1971 23mar2015 centerdate 05feb1971 omit dayofweek (Sa Su)
The first line specifies the current version of Stata I am using. The second line is optional, but the text typed there will display if I type bcal describe nasdaq and is good for record-keeping when I have multiple calenders. Line 3 specifies the display date format and is also optional. Line 4 specifies the range of dates in the dataset.
Line 5 specifies the center of the date to be 05feb1971. I picked the first date in the sample, but I could have picked any date that is in the range specified for the business calendar. centerdate does not mean choosing a date that is in fact the center of the sample. For example, Stata’s default %td calendar uses 01jan1960 as its center.
The last statement specifies to omit weekends of every month. Later I will show several variations of the omit command to omit other holidays. Once I have a business calendar, I can use this to convert regular dates to business dates, share this file with colleagues, and also make further changes to my calendar.
Using a business calendar
. bcal load nasdaq loading ./nasdaq.stbcal ... 1. version 14.1 2. purpose "Converting daily financial data into business calendar dates" 3. dateformat dmy 4. range 05feb1971 23mar2015 5. centerdate 05feb1971 6. omit dayofweek (Sa Su) (calendar loaded successfully) . generate bcaldate = bofd("nasdaq",mydate) . assert !missing(bcaldate) if !missing(mydate)
In order to create business dates using bofd(), I specified two arguments: the name of the business calendar and the name of the variable containing regular dates. The assert statement verifies that all dates recorded in mydate appear in the business calendar. This is a way of checking that I created my calendar for the complete date range — the bofd() function returns a missing value when mydate does not appear on the specified calendar.
Business dates have a specific display format, %tbcalname, which in my case is %tbnasdaq. In order to display business dates in a Stata date format I will apply this format to bcaldate just as I would for a regular date.
. format %tbnasdaq bcaldate . list in 1/5 +---------------------------------------------+ | date index mydate bcaldate | |---------------------------------------------| 1. | 1971-02-05 100 05feb1971 05feb1971 | 2. | 1971-02-08 100.84 08feb1971 08feb1971 | 3. | 1971-02-09 100.76 09feb1971 09feb1971 | 4. | 1971-02-10 100.69 10feb1971 10feb1971 | 5. | 1971-02-11 101.45 11feb1971 11feb1971 | +---------------------------------------------+
Although mydate and bcaldate look similar, they have different encodings. Now I can tsset on the business date bcaldate and list the first five observations with the lag of index recalculated.
. tsset bcaldate time variable: bcaldate, 05feb1971 to 23mar2015, but with gaps delta: 1 day . list bcaldate index l.index in 1/5 +-----------------------------+ | L.| | bcaldate index index | |-----------------------------| 1. | 05feb1971 100 . | 2. | 08feb1971 100.84 100 | 3. | 09feb1971 100.76 100.84 | 4. | 10feb1971 100.69 100.76 | 5. | 11feb1971 101.45 100.69 | +-----------------------------+
As expected, the issue of gaps due to weekends is now resolved. Since I have a calendar that excludes Saturdays and Sundays, bcaldate skipped the weekend between 05feb1971 and 08feb1971 when calculating the lagged index value and will do the same for any subsequent weekends in the data.
Excluding specific dates
So far I have not excluded gaps in the data due to other major holidays, such as Thanksgiving and Christmas. Stata has several variations on the omit command that let you exclude specific dates. For example, I use the omit command to omit the Thanksgiving holiday (the fourth Thursday of November in the U.S.) by adding the following statement in my business calendar.
omit dowinmonth +4 Th of Nov
dowinmonth stands for day of week in month and +4 Th of Nov refers to the fourth Thursday of November. This rule is applied to every year in the data.
Another major holiday is Christmas, with the NASDAQ closed on the 25th of December every year. I can omit this holiday in the calendar as
omit date 25dec*
The * in the statement above indicates that December 25th should be omitted for every year in my nasdaq calendar.
This rule is misleading since the 25th may be on a weekend in which case the holidays are on the preceeding Friday or following Monday. To capture these cases, I add the following statements
omit date 25dec* and (-1) if dow(Sa) omit date 25dec* and (+1) if dow(Su)
The first statement omits December 24th if Christmas is on a Saturday, and the second statement omits December 26th if Christmas is on a Sunday.
Encodings
I mentioned earlier that the encodings of regular date mydate and business date bcaldate are different. To see the encodings of my date variables, I apply the numerical format and list the first five observations.
. format %8.0g mydate bcaldate . list in 1/5 +-----------------------------------------+ | date index mydate bcaldate | |-----------------------------------------| 1. | 1971-02-05 100 4053 0 | 2. | 1971-02-08 100.84 4056 1 | 3. | 1971-02-09 100.76 4057 2 | 4. | 1971-02-10 100.69 4058 3 | 5. | 1971-02-11 101.45 4059 4 | +-----------------------------------------+
The variable bcaldate starts with 0 because this was the centerdate in my calendar nasdaq.stbcal. The business date encoding is consecutive without gaps which is why using lags or any time series operators will yield correct values.
Summary
Using regular dates with time-series data instead of business dates may be misleading in case there are gaps in the data. In this post, I showed a convenient way to work with business dates by creating a business calendar. Once I loaded a calendar file into Stata, I created business dates using the bofd() function. I also showed some variations of the omit command used in business calendars to accomodate specific gaps due to different holidays.
]]>I build on previous posts. I use the structure of Stata programs that use Mata work functions that I discussed previously in Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: An OLS command using Mata. You should be familiar with Poisson regression and using optimize(), which I discussed in Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters.
A poisson command with Mata computations
The Stata command mypoisson1 computes the results in Mata. The syntax of the mypoisson1 command is
mypoisson1 depvar indepvars [if] [in] [, noconstant]
where indepvars can contain time-series variables. mypoisson1 does not allow for factor variables because they complicate the program. I discuss these complications, and present solutions, in my next post.
In the remainder of this post, I discuss the code for mypoisson1.ado. I recommend that you click on the file name to download the code. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.
Code block 1: mypoisson1.ado
*! version 1.0.0 31Jan2016 program define mypoisson1, eclass sortpreserve version 14.1 syntax varlist(numeric ts min=2) [if] [in] [, noCONStant ] marksample touse gettoken depvar indepvars : varlist _rmcoll `indepvars', `constant' forcedrop local indepvars "`r(varlist)'" tempname b V N rank mata: mywork("`depvar'", "`indepvars'", "`touse'", "`constant'", /// "`b'", "`V'", "`N'", "`rank'") if "`constant'" == "" { local cnames "`indepvars' _cons" } else { local cnames "`indepvars'" } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn local cmd "mypoisson1" ereturn display end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar bname, string scalar Vname, string scalar nname, string scalar rname) { real vector y, b, mo_v, cv real matrix X, V, Cm real scalar n, p, rank, ko y = st_data(., depvar, touse) n = rows(y) X = st_data(., indepvars, touse) if (constant == "") { X = X,J(n, 1, 1) } p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &plleval2()) optimize_init_params(S, J(1, p, .01)) b = optimize(S) V = optimize_result_V_oim(S) rank = p - diag0cnt(invsym(V)) st_matrix(bname, b) st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, rank) } void plleval2(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb xb = X*b' val = sum(-exp(xb) + y:*xb - lnfactorial(y)) } end
mypoisson1.ado has the structure of Stata programs that compute their results in Mata, which I discussed in Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: An OLS command using Mata. Lines 1–35 define the ado-command mypoisson1. Lines 37–83 define the Mata work function mywork() used in mypoisson1 and the evaluator function plleval2() used in mywork().
The ado-command mypoisson1 has four parts:
The Mata work function mywork() has four parts.
Now, I discuss the ado-code in some detail. Lines 2–35 are almost the same as lines 2–35 of myregress11, which I discussed in Programming an estimation command in Stata: An OLS command using Mata. That myregress11 handles factor variables but mypoisson1 does not causes most of the differences. That myregress11 stores the residual degrees of freedom but mypoisson1 does not causes two minor differences.
myregress11 uses the matrix inverter to handle cases of collinear independent variables. In the Poisson-regression case discussed here, collinear independent variables define an unconstrained problem without a unique solution, so optimize() cannot converge. mypoisson1 drops collinear independent variables to avoid this problem. I discuss a better solution in my next post.
Lines 10–11 use _rmcoll to drop the collinear independent variables and store the list of linearly independent variables in the local macro indepvars.
Now, I discuss the Mata work function mywork() in some detail. The mywork() defined on lines 39–71 is remarkably similar to the mywork() function defined on lines 39–72 of myregress11.ado, discussed in Programming an estimation command in Stata: An OLS command using Mata. In mypoisson1.ado, lines 57–65 use optimize() to compute the results. In myregress11.ado, lines 58–64 use matrix computations to compute the results.
Lines 73–81 define the evaluator function plleval2(), which is used by optimize() to compute the results, as I discussed in Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters.
Examples 1 and 2 illustrate that mypoisson1 produces the same results as poisson.
Example 1: mypoisson1
(Uses accident3.dta)
. clear all . use accident3 . mypoisson1 accidents cvalue kids traffic Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66874 Iteration 2: f(p) = -555.81708 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .0706484 -9.28 0.000 -.7943554 -.5174188 kids | -1.009017 .0807961 -12.49 0.000 -1.167374 -.8506596 traffic | .1467115 .0313762 4.68 0.000 .0852153 .2082077 _cons | .5743543 .2839519 2.02 0.043 .0178187 1.13089 ------------------------------------------------------------------------------
Example 2: poisson
poisson accidents cvalue kids traffic Iteration 0: log likelihood = -555.86605 Iteration 1: log likelihood = -555.8154 Iteration 2: log likelihood = -555.81538 Poisson regression Number of obs = 505 LR chi2(3) = 340.20 Prob > chi2 = 0.0000 Log likelihood = -555.81538 Pseudo R2 = 0.2343 ------------------------------------------------------------------------------ accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .0706484 -9.28 0.000 -.7943553 -.5174188 kids | -1.009017 .0807961 -12.49 0.000 -1.167374 -.8506594 traffic | .1467115 .0313762 4.68 0.000 .0852153 .2082076 _cons | .574354 .2839515 2.02 0.043 .0178193 1.130889 ------------------------------------------------------------------------------
Done and undone
I discussed mypoisson1, which computes Poisson-regression results in Mata. I highlighted how similar the code in mypoisson1.ado is to the code in myregress11.ado.
I also discussed that mypoisson1 drops collinear independent variables and noted that mypoisson1 does not handle factor variables. In my next post, I discuss a solution to the problems caused by collinear independent variables and discuss a command that handles factor variables.
]]>Using optimize()
There are many optional choices that one may make when solving a nonlinear optimization problem, but there are very few that one must make. The optimize*() functions in Mata handle this problem by making a set of default choices for you, requiring that you specify a few things, and allowing you to change any of the default choices.
When I use optimize() to solve a nonlinear optimization problem, I do four steps.
: S = optimize_init()
which contains all the default choices
: betahat = optimize(S)
to perform the optimization.
Consider maximizing the log-likelihood function of a Poisson model. The contribution of the \(i\)th observation to the log-likelihood is
\[
f_i(\betab) = y_i{\bf x_i}\betab – \exp({\bf x}_i\betab’) – \ln( y_i !)
\]
where \(y_i\) is the dependent variable, \({\bf x}_i\) is the vector of covariates, and \(\betab\) is the row vector of parameters that we select to maximize the log-likelihood function given by \(F(\betab) =\sum_i f_i(\betab)\). I could drop \(ln(y_i!)\), because it does not depend on the parameters. I include it to make the value of the log-likelihood function the same as that reported by Stata. Stata includes these terms so that the values of the log-likelihood functions are comparable across models.
The code block 1 copies the data from Stata to Mata and computes the Poisson log-likelihood function at the vector of parameter values b, which has been set to the arbitrary starting values of .01 for each parameter.
Code block 1: An evaluator function for the Poisson log-likelihood
clear all use accident3 mata: y = st_data(., "accidents") X = st_data(., "cvalue kids traffic") X = X,J(rows(X), 1, 1) b = J(1, cols(X), .01) xb = X*b' f = sum(-exp(xb) + y:*xb - lnfactorial(y)) end
The Mata function plleval() in code block 2 puts the value of the Poisson log-likelihood function at the vector of parameter values b into val.
Code block 2: An evaluator function for the Poisson log-likelihood
mata: void plleval(real scalar todo, real vector b, val, grad, hess) { real vector y, xb real matrix X y = st_data(., "accidents") X = st_data(., "cvalue kids traffic") X = X,J(rows(X), 1, 1) xb = X*b' val = sum(-exp(xb) + y:*xb - lnfactorial(y)) } end
plleval() has the default syntax of an evaluator function that optimize() can call. Evaluator functions have a default syntax so that optimize() can call them, which it must do to find the maximum. After describing the default syntax, I will show how to use evaluators with extra arguments.
plleval() is void, it returns nothing. The real scalar todo allows optimize() to tell the evaluator function what it must compute. The real vector b is the current value of the parameter vector. val is not typed because no matter what it contains on input, it will contain the value of the objective function on output. grad is not typed because it will optionally contain the vector of first derivatives of the objective function at the current value of b on output. hess is not typed because it will optionally contain the matrix of second derivatives of the objective function at the current value of b on output. As plleval() illustrates, the objective function must put the value of the objective function into the third argument, but it need not compute either the vector of first derivatives or the matrix of second derivatives.
In example 1, I use optimize() to maximize the Poisson log-likelihood function computed in plleval().
Example 1: Using optimize() to estimate Poisson parameters
(Uses accident3.dta)
. clear all . use accident3 . mata: ------------------------------------------------- mata (type end to exit) ------ : void plleval(real scalar todo, real vector b, val, grad, hess) > { > real vector y, xb > real matrix X > > y = st_data(., "accidents") > X = st_data(., "cvalue kids traffic") > X = X,J(rows(X), 1, 1) > xb = X*b' > val = sum(-exp(xb) + y:*xb - lnfactorial(y)) > } note: argument todo unused note: argument grad unused note: argument hess unused : : S = optimize_init() : optimize_init_evaluator(S, &plleval()) : optimize_init_params(S, J(1, 4, .01)) : bh = optimize(S) Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66874 Iteration 2: f(p) = -555.81708 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 : bh 1 2 3 4 +-------------------------------------------------------------+ 1 | -.6558871399 -1.009017051 .1467114648 .5743542793 | +-------------------------------------------------------------+ : optimize_result_params(S) 1 2 3 4 +-------------------------------------------------------------+ 1 | -.6558871399 -1.009017051 .1467114648 .5743542793 | +-------------------------------------------------------------+ : sqrt(diagonal(optimize_result_V_oim(S)))' 1 2 3 4 +---------------------------------------------------------+ 1 | .0706483931 .0807960852 .0313761961 .2839519366 | +---------------------------------------------------------+ : end --------------------------------------------------------------------------------
After defining plleval(), I use optimize_init() to create the optimize() object S. I must put information about how to call plleval() and the vector of starting values into S. Typing
optimize_init_evaluator(S, &plleval())
puts the address of the evaluator function plleval() into S. By preceding the name of the evaluator function plleval() with an ampersand (&), I put the address of the evaluator function into S. optimize() requires that you put the function address instead of the function name because having the address speeds up finding the function. Typing
optimize_init_params(S, J(1, 4, .01))
puts the vector of starting values, J(1, 4, .01), into S.
Typing
bh = optimize(S)
causes optimize() to solve the optimization problem described in S, and it causes optimize() to put the vector of optimal parameters in bh. optimize() produces the default iteration log, because we did not change the default specification in S.
When optimize() has completed, the results are in S. For example, I display the bh returned by optimize() and use optimize_result_params(S) to display the result stored in S. I further illustrate by displaying the standard errors; optimize_result_V_oim() retrieves the observed-information-matrix (OIM) estimator of the variance–covariance of the estimator (VCE). Many other results are stored in S; type help mf optimize and look at the optimize_result*() functions for details.
Comparing the results in examples 1 and 2 shows that they are correct.
Example 2: Results from poisson
. poisson accidents cvalue kids traffic Iteration 0: log likelihood = -555.86605 Iteration 1: log likelihood = -555.8154 Iteration 2: log likelihood = -555.81538 Poisson regression Number of obs = 505 LR chi2(3) = 340.20 Prob > chi2 = 0.0000 Log likelihood = -555.81538 Pseudo R2 = 0.2343 ------------------------------------------------------------------------------ accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .0706484 -9.28 0.000 -.7943553 -.5174188 kids | -1.009017 .0807961 -12.49 0.000 -1.167374 -.8506594 traffic | .1467115 .0313762 4.68 0.000 .0852153 .2082076 _cons | .574354 .2839515 2.02 0.043 .0178193 1.130889 ------------------------------------------------------------------------------
plleval() is slow because it copies the data from Stata to Mata every time optimize() calls it. I would much rather pass the data to the evaluator function, but this requires putting information about the syntax of the new evaluator function in S. For example, I would like to use the evaluator function plleval2(). In example 3, I use optimize_init_argument() to put information into S about the extra arguments accepted by the new evaluator function plleval2().
Code block 3: Passing data to the Poisson evaluator function
mata: void plleval2(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb xb = X*b' val = sum(-exp(xb) + y:*xb - lnfactorial(y)) } end
Line 3 declares the extra arguments, the real vector y, and the real vector X. The extra arguments come between the inputs that must always be present, the real scalar todo and the real vector b, and the always-present outputs; val, grad, and hess.
Example 3 uses optimize() to maximize the Poisson objective function coded in plleval2().
Example 3: Using optional arguments to pass data
. mata: ------------------------------------------------- mata (type end to exit) ------ : void plleval2(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector xb > > xb = X*b' > val = sum(-exp(xb) + y:*xb - lnfactorial(y)) > } note: argument todo unused note: argument grad unused note: argument hess unused : : y = st_data(., "accidents") : X = st_data(., "cvalue kids traffic") : X = X,J(rows(X), 1, 1) : : S = optimize_init() : optimize_init_argument(S, 1, y) : optimize_init_argument(S, 2, X) : optimize_init_evaluator(S, &plleval2()) : optimize_init_params(S, J(1, 4, .01)) : : bh = optimize(S) Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66874 Iteration 2: f(p) = -555.81708 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 : optimize_result_params(S) 1 2 3 4 +-------------------------------------------------------------+ 1 | -.6558871399 -1.009017051 .1467114648 .5743542793 | +-------------------------------------------------------------+ : sqrt(diagonal(optimize_result_V_oim(S)))' 1 2 3 4 +---------------------------------------------------------+ 1 | .0706483931 .0807960852 .0313761961 .2839519366 | +---------------------------------------------------------+ : end --------------------------------------------------------------------------------
After defining plleval2(), I copy the data from Stata to Mata, and I use optimize_init() to put the default choices into the optimize() object S. When I typed
optimize_init_argument(S, 1, y)
I put information into S specifying that optimize() should pass y as the first extra argument to the evaluator function. When I typed
optimize_init_argument(S, 2, X)
I put information into S specifying that optimize() should pass X as the second extra argument to the evaluator function.
Analogous to example 1, typing
optimize_init_evaluator(S, &plleval2())
puts the address of plleval2() into S, and typing
optimize_init_params(S, J(1, 4, .01))
puts the vector of starting values, J(1, 4, .01), in S.
The results are the same as those in example 1.
Vector of observation–level contributions and robust VCE estimation
Robust estimators for the VCE of an estimator use the structure of observation-level contributions; see Wooldridge (2010, chapters 12 and 13) or Cameron and Trivedi (2005, chapter 5). When the evaluator function gives optimize() a vector of observation-level contributions, instead of a scalar summation, optimize() can use this structure to compute robust or cluster–robust estimators of the VCE.
Consider plleval3(), which puts the vector of observation-level contributions into val.
Code block 4: A vector of observation-level contributions
mata: void plleval3(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb xb = X*b' val = -exp(xb) + y:*xb - lnfactorial(y) } end
To use plleval3(), I must put information in the optimize() object stating that the evaluator function computes a vector of observation-level contributions. In example 4, I use optimize_init_evaluatortype() to put this information into the optimize() object S.
Example 4: Robust VCE estimation
. mata: ------------------------------------------------- mata (type end to exit) ------ : void plleval3(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector xb > > xb = X*b' > val = -exp(xb) + y:*xb - lnfactorial(y) > } note: argument todo unused note: argument grad unused note: argument hess unused : : : y = st_data(., "accidents") : X = st_data(., "cvalue kids traffic") : X = X,J(rows(X), 1, 1) : : S = optimize_init() : optimize_init_argument(S, 1, y) : optimize_init_argument(S, 2, X) : optimize_init_evaluator(S, &plleval3()) : optimize_init_evaluatortype(S, "gf0") : optimize_init_params(S, J(1, 4, .01)) : : bh = optimize(S) Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66874 Iteration 2: f(p) = -555.81731 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 : optimize_result_params(S) 1 2 3 4 +-------------------------------------------------------------+ 1 | -.6558871527 -1.009017051 .1467114658 .5743542978 | +-------------------------------------------------------------+ : sqrt(diagonal(optimize_result_V_oim(S)))' 1 2 3 4 +---------------------------------------------------------+ 1 | .0706483832 .0807960809 .031376176 .2839517337 | +---------------------------------------------------------+ : sqrt(diagonal(optimize_result_V_robust(S)))' 1 2 3 4 +---------------------------------------------------------+ 1 | .1096020124 .188666044 .092431746 .6045057623 | +---------------------------------------------------------+ : end --------------------------------------------------------------------------------
After defining plleval3(), I copy the data, create the optimize() object S, put the specifications for the extra arguments y and X in S, and put the address of plleval3() into S. Typing
optimize_init_evaluatortype(S, “gf0”)
puts in S the information that the evaluator function returns a vector of observation-level contribution and that it computes zero derivatives, that is the evaluator function is type “gf0”. Given the vector structure, I can type
optimize_result_V_robust(S)
to compute a robust estimator of the VCE.
sqrt(diagonal(optimize_result_V_robust(S)))’
returns the robust standard errors, which are the same as those reported by poisson in example 5.
Example 5: Robust VCE estimation by poisson
. poisson accidents cvalue kids traffic, vce(robust) Iteration 0: log pseudolikelihood = -555.86605 Iteration 1: log pseudolikelihood = -555.8154 Iteration 2: log pseudolikelihood = -555.81538 Poisson regression Number of obs = 505 Wald chi2(3) = 99.76 Prob > chi2 = 0.0000 Log pseudolikelihood = -555.81538 Pseudo R2 = 0.2343 ------------------------------------------------------------------------------ | Robust accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .1096019 -5.98 0.000 -.8707029 -.4410712 kids | -1.009017 .188666 -5.35 0.000 -1.378795 -.6392382 traffic | .1467115 .0924316 1.59 0.112 -.0344512 .3278741 _cons | .574354 .6045047 0.95 0.342 -.6104535 1.759162 ------------------------------------------------------------------------------
Done and undone
I showed how to use optimize() to maximize a Poisson log-likelihood function. I also showed how to obtain a robust estimator of the VCE by coding the evaluator function to compute a vector of observation-level contributions. In my next post, I show how to write a Stata command that uses Mata to estimate the parameters of a Poisson regression model.
References
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.
]]>A quick review of nonlinear optimization
We want to maximize a real-valued function \(Q(\thetab)\), where \(\thetab\) is a \(p\times 1\) vector of parameters. Minimization is done by maximizing \(-Q(\thetab)\). We require that \(Q(\thetab)\) is twice, continuously differentiable, so that we can use a second-order Taylor series to approximate \(Q(\thetab)\) in a neighborhood of the point \(\thetab_s\),
\[
Q(\thetab) \approx Q(\thetab_s) + \gb_s'(\thetab -\thetab_s)
+ \frac{1}{2} (\thetab -\thetab_s)’\Hb_s (\thetab -\thetab_s)
\tag{1}
\]
where \(\gb_s\) is the \(p\times 1\) vector of first derivatives of \(Q(\thetab)\) evaluated at \(\thetab_s\) and \(\Hb_s\) is the \(p\times p\) matrix of second derivatives of \(Q(\thetab)\) evaluated at \(\thetab_s\), known as the Hessian matrix.
Nonlinear maximization algorithms start with a vector of initial values and produce a sequence of updated values that converge to the parameter vector that maximizes the objective function. The algorithms I discuss here can only find local maxima. The function in figure 1 has a local maximum at .2 and another at 1.5. The global maximum is at .2.
Each update is produced by finding the \(\thetab\) that maximizes the approximation on the right-hand side of equation (1) and letting it be \(\thetab_{s+1}\). To find the \(\thetab\) that maximizes the approximation, we set to \({\bf 0}\) the derivative of the right-hand side of equation (1) with respect to \(\thetab\),
\[
\gb_s + \Hb_s (\thetab -\thetab_s) = {\bf 0}
\tag{2}
\]
Replacing \(\thetab\) with \(\thetab_{s+1}\) and solving yields the update rule for \(\thetab_{s+1}\).
\[
\thetab_{s+1} = \thetab_s – \Hb_s^{-1} \gb_s
\tag{3}
\]
Note that the update is uniquely defined only if the Hessian matrix \(\Hb_s\) is full rank. To ensure that we have a local maximum, we will require that the Hessian be negative definite at the optimum, which also implies that the symmetric Hessian is full rank.
The update rule in equation (3) does not guarantee that \(Q(\thetab_{s+1})>Q(\thetab_s)\). We want to accept only those \(\thetab_{s+1}\) that do produce such an increase, so in practice, we use
\[
\thetab_{s+1} = \thetab_s – \lambda \Hb_s^{-1} \gb_s
\tag{4}
\]
where \(\lambda\) is the step size. In the algorithm presented here, we start with \(\lambda\) equal to \(1\) and, if necessary, decrease \(\lambda\) until we find a value that yields an increase.
The previous sentence is vague. I clarify it by writing an algorithm in Mata. Suppose that real scalar Q( real vector theta ) is a Mata function that returns the value of the objective function at a value of the parameter vector theta. For the moment, suppose that g_s is the vector of derivatives at the current theta, denoted by theta_s, and that Hi_s is the inverse of the Hessian matrix at theta_s. These definitions allow us to define the update function
real vector tupdate( /// real scalar lambda, /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { return (theta_s - lambda*Hi_s*g_s) }
For specified values of lambda, theta_s, g_s, and Hi_s, tupdate() returns a candidate value for theta_s1. But we only accept candidate values of theta_s1 that yield an increase, so instead of using tupdate() to get an update, we would use GetUpdate().
real vector GetUpdate( /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { lambda = 1 theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) while ( Q(theta_s1) < Q(theta_s) ) { lambda = lambda/2 theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) } return(theta_s1) }
GetUpdate() starts by getting a candidate value for theta_s1 when lambda = 1. GetUpdate() returns this candidate theta_s1 if it produces an increase in Q(). Otherwise, GetUpdate() divides lambda by 2 and gets another candidate theta_s1 until it finds a candidate that produces an increase in Q(). GetUpdate() returns the first candidate that produces an increase in Q().
While these functions clarify the ambiguities in the original vague statement, GetUpdate() makes the unwise assumption that there is always a lambda for which the candidate theta_s1 produces an increase in Q(). The version of GetUpdate() in code block 3 does not make this assumption, it exits with an error if lambda is too small; less than \(10^{-11}\).
real vector GetUpdate( /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { lambda = 1 theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) while ( Q(theta_s1) < Q(theta_s) ) { lambda = lambda/2 if (lambda < 1e-11) { printf("{red}Cannot find parameters that produce an increase.\n") exit(error(3360)) } theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) } return(theta_s1) }
An outline of our algorithm for nonlinear optimization is the following:
Code block 4 contains a Mata version of this algorithm.
theta_s = J(p, 1, .01) GetDerives(theta_s, g_s, Hi_s.) gz = g_s'*Hi*g while (abs(gz) > 1e-13) { theta_s = GetUpdate(theta_s, g_s, Hi_s) GetDerives(theta_s, g_s, Hi_s) gz = g_s'*Hi_s*g_s printf("gz is now %8.7g\n", gz) } printf("Converged value of theta is\n") theta_s
Line 2 puts the vector of starting values, a \(p\times 1\) vector with each element equal to .01, in theta_s. Line 3 uses GetDerives() to put the vector of first derivatives into g_s and the inverse of the Hessian matrix into Hi_s. In GetDerives(), I use cholinv() to calculate Hi_s. cholinv() returns missing values if the matrix is not positive definite. By calculating Hi_s = -1*cholinv(-H_s), I ensure that Hi_s contains missing values when the Hessian is not negative definite and full rank.
Line 3 calculates how different the vector of first derivatives is from 0. Instead of using a sum of squares, obtainable by g_s’g_s, I weight the first derivatives by the inverse of the Hessian matrix, which puts the \(p\) first derivatives on a similar scale and ensures that the Hessian matrix is negative definite at convergence. (If the Hessian matrix is not negative definite, GetDerives() will put a matrix of missing values into Hi_s, which causes gz=., which will exceed the tolerance.)
To flush out the details we need a specific problem. Consider maximizing the log-likelihood function of a Poisson model, which has a simple functional form. The contribution of each observation to the log-likelihood is
\[
f_i(\betab) = y_i{\bf x_i}\betab – \exp({\bf x}_i\betab) – \ln( y_i !)
\]
where \(y_i\) is the dependent variable, \({\bf x}_i\) is the vector of covariates, and \(\betab\) is the vector of parameters that we select to maximize the log-likelihood function given by \(F(\betab) =\sum_i f_i(\betab)\). I could drop ln(y_i!), because it does not depend on the parameters. I include it to make the value of the log-likelihood function the same as that reported by Stata. Stata includes these terms so that log-likelihood-function values are comparable across models.
The pll() function in code block 5 computes the Poisson log-likelihood function from the vector of observations on the dependent variable y, the matrix of observations on the covariates X, and the vector of parameter values b.
// Compute Poisson log-likelihood mata: real scalar pll(real vector y, real matrix X, real vector b) { real vector xb xb = x*b f = sum(-exp(xb) + y:*xb - lnfactorial(y)) } end
The vector of first derivatives is
\[
\frac{\partial~F(\xb_,\betab)}{\partial \betab}
= \sum_{i=1}^N (y_i – \exp(\xb_i\betab))\xb_i
\]
which I can compute in Mata as quadcolsum((y-exp(X*b)):*X), and the Hessian matrix is
\[
\label{eq:H}
\sum_{i=1}^N\frac{\partial^2~f(\xb_i,\betab)}{\partial\betab
\partial\betab^\prime}
= – \sum_{i=1}^N\exp(\xb_i\betab)\xb_i^\prime \xb_i
\tag{5}
\]
which I can compute in Mata as -quadcross(X, exp(X*b), X).
Here is some code that implements this Newton-Raphson NR algorithm for the Poisson regression problem.
// Newton-Raphson for Poisson log-likelihood clear all use accident3 mata: real scalar pll(real vector y, real matrix X, real vector b) { real vector xb xb = X*b return(sum(-exp(xb) + y:*xb - lnfactorial(y))) } void GetDerives(real vector y, real matrix X, real vector theta, g, Hi) { real vector exb exb = exp(X*theta) g = (quadcolsum((y - exb):*X))' Hi = quadcross(X, exb, X) Hi = -1*cholinv(Hi) } real vector tupdate( /// real scalar lambda, /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { return (theta_s - lambda*Hi_s*g_s) } real vector GetUpdate( /// real vector y, /// real matrix X, /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { real scalar lambda real vector theta_s1 lambda = 1 theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s) while ( pll(y, X, theta_s1) <= pll(y, X, theta_s) ) { lambda = lambda/2 if (lambda < 1e-11) { printf("{red}Cannot find parameters that produce an increase.\n") exit(error(3360)) } theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s) } return(theta_s1) } y = st_data(., "accidents") X = st_data(., "cvalue kids traffic") X = X,J(rows(X), 1, 1) b = J(cols(X), 1, .01) GetDerives(y, X, b, g=., Hi=.) gz = . while (abs(gz) > 1e-11) { bs1 = GetUpdate(y, X, b, g, Hi) b = bs1 GetDerives(y, X, b, g, Hi) gz = g'*Hi*g printf("gz is now %8.7g\n", gz) } printf("Converged value of beta is\n") b end
Line 3 reads in the downloadable accident3.dta dataset before dropping down to Mata. I use variables from this dataset on lines 56 and 57.
Lines 6–11 define pll(), which returns the value of the Poisson log-likelihood function, given the vector of observations on the dependent variable y, the matrix of covariate observations X, and the current parameters b.
Lines 13‐21 put the vector of first derivatives in g and the inverse of the Hessian matrix in Hi. Equation 5 specifies a matrix that is negative definite, as long as the covariates are not linearly dependent. As discussed above, cholinv() returns a matrix of missing values if the matrix is not positive definite. I multiply the right-hand side on line 20 by \(-1\) instead of on line 19.
Lines 23–30 implement the tupdate() function previously discussed.
Lines 32–53 implement the GetUpdate() function previously discussed, with the caveats that this version handles the data and uses pll() to compute the value of the objective function.
Lines 56–58 get the data from Stata and row-join a vector to X for the constant term.
Lines 60–71 implement the NR algorithm discussed above for this Poisson regression problem.
Running nr1.do produces
Example 1: NR algorithm for Poisson
. do pnr1 . // Newton-Raphson for Poisson log-likelihood . clear all . use accident3 . . mata: [Output Omitted] : b = J(cols(X), 1, .01) : GetDerives(y, X, b, g=., Hi=.) : gz = . : while (abs(gz) > 1e-11) { > bs1 = GetUpdate(y, X, b, g, Hi) > b = bs1 > GetDerives(y, X, b, g, Hi) > gz = g'*Hi*g > printf("gz is now %8.7g\n", gz) > } gz is now -119.201 gz is now -26.6231 gz is now -2.02142 gz is now -.016214 gz is now -1.3e-06 gz is now -8.3e-15 : printf("Converged value of beta is\n") Converged value of beta is : b 1 +----------------+ 1 | -.6558870685 | 2 | -1.009016966 | 3 | .1467114652 | 4 | .5743541223 | +----------------+ : : end -------------------------------------------------------------------------------- . end of do-file
The point estimates in example 1 are equivalent to those produced by poisson.
Example 2: poisson results
. poisson accidents cvalue kids traffic Iteration 0: log likelihood = -555.86605 Iteration 1: log likelihood = -555.8154 Iteration 2: log likelihood = -555.81538 Poisson regression Number of obs = 505 LR chi2(3) = 340.20 Prob > chi2 = 0.0000 Log likelihood = -555.81538 Pseudo R2 = 0.2343 ------------------------------------------------------------------------------ accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .0706484 -9.28 0.000 -.7943553 -.5174188 kids | -1.009017 .0807961 -12.49 0.000 -1.167374 -.8506594 traffic | .1467115 .0313762 4.68 0.000 .0852153 .2082076 _cons | .574354 .2839515 2.02 0.043 .0178193 1.130889 ------------------------------------------------------------------------------
Done and undone
I implemented a simple nonlinear optimizer to practice Mata programming and to review the theory behind nonlinear optimization. In future posts, I implement a command for Poisson regression that uses the optimizer in optimize().
]]>myregress12.ado performs ordinary least-squares (OLS) regression, and it extends myregress11.ado, which I discussed in Programming an estimation command in Stata: An OLS command using Mata. To get the most out of this post, you should be familiar with Programming an estimation command in Stata: Using a subroutine to parse a complex option and Programming an estimation command in Stata: Computing OLS objects in Mata.
Parsing the vce() option
I used ado-subroutines to simplify the parsing of the options vce(robust) and vce(cluster cvarname) in myregress10.ado; see Programming an estimation command in Stata: Using a subroutine to parse a complex option. Part of the point was to illustrate how to write ado-subroutines and the programming tricks that I used in these subroutines.
Here I use the undocumented command _vce_parse to simplify the parsing. There are many undocumented commands designed to help Stata programmers. They are undocumented in that they are tersely documented in the system help but not documented in the manuals. In addition, the syntax or behavior of these commands may change over Stata releases, although this rarely happens.
_vce_parse helps Stata programmers parse the vce() option. To see how it works, consider the problem of parsing the syntax of myregress12.
myregress12 depvar [indepvars] [if] [in] [, vce(robust | cluster clustervar) noconstant]
where indepvars can contain factor variables or time-series variables.
I can use the syntax command to put whatever the user specifies in the option vce() into the local macro vce, but I still must (1) check that what was specified makes sense and (2) create local macros that the code can use to do the right thing. Examples 1–7 create the local macro vce, simulating what syntax would do, and then use _vce_robust to perform tasks (1) and (2).
I begin with the case in which the user specified vce(robust); here the local macro vce would contain the word robust.
Example 1: parsing vce(robust)
. clear all . sysuse auto (1978 Automobile Data) . local vce "robust" . _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(vceopt) : "vce(robust)" r(vce) : "robust"
The command
_vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce’)
has two pieces. The piece before the colon (:) specifies the rules; the piece after the colon specifies what the user typed. Each piece can have a Stata object followed by some options; note the commas before optlist(Robust) and before vce(`vce’). In the case at hand, the second piece only contains what the user specified – vce(robust) – and the first piece only contains the options optlist(Robust) and argoptlist(CLuster). The option optlist(Robust) specifies that the vce() option in the second piece may contain the option robust and that its minimal abbreviation is r. Note how the word Robust in optlist(Robust) mimics how syntax specifies minimum abbreviations. The option argoptlist(CLuster) specifies that the vce() option in the second piece may contain cluster clustervar, that the minimum abbreviation of cluster is cl, and that it will put the argument clustervar into a local macro.
After the command,
_vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce’)
I use return list to show what _vce_parse stored in r(). Because local macro vce contains “robust”, _vce_parse
Examples 2 and 3 illustrate that _vce_parse stores the same values in these local macros when the user specifies vce(rob) or vce(r), which are valid abbreviations for vce(robust).
Example 2: parsing vce(rob)
. local vce "rob" . _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(vceopt) : "vce(robust)" r(vce) : "robust"
Example 3: parsing vce(r)
. local vce "r" . _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(vceopt) : "vce(robust)" r(vce) : "robust"
Now, consider parsing the option vce(cluster clustervar). Because the cluster variable clustervar may contain missing values, _vce_parse may need to update a sample-identification variable before it stores the name of the cluster variable in a local macro. In example 4, I use the command
_vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce’)
to handle the case when the user specifies vce(cluster rep78). The results from the tabulate and summarize commands illustrate that _vce_parse updates the sample-identification variable mytouse to account for the missing observations in rep78.
Example 4: parsing vce(cluster rep78)
. generate byte mytouse = 1 . tabulate mytouse mytouse | Freq. Percent Cum. ------------+----------------------------------- 1 | 74 100.00 100.00 ------------+----------------------------------- Total | 74 100.00 . summarize rep78 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- rep78 | 69 3.405797 .9899323 1 5 . local vce "cluster rep78" . _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(cluster) : "rep78" r(vceopt) : "vce(cluster rep78)" r(vceargs) : "rep78" r(vce) : "cluster" . tabulate mytouse mytouse | Freq. Percent Cum. ------------+----------------------------------- 0 | 5 6.76 6.76 1 | 69 93.24 100.00 ------------+----------------------------------- Total | 74 100.00
I use return list to show what _vce_parse stored in r(). Because local macro vce contains cluster rep78, _vce_parse
Examples 5 and 6 illustrate that _vce_parse stores the same values in these local macros when the user specifies vce(clus rep78) or vce(cl rep78), which are valid abbreviations for vce(cluster rep78).
Example 5: parsing vce(clus rep78)
. local vce "clus rep78" . _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(cluster) : "rep78" r(vceopt) : "vce(cluster rep78)" r(vceargs) : "rep78" r(vce) : "cluster"
Example 6: parsing vce(cl rep78)
. local vce "cl rep78" . _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(cluster) : "rep78" r(vceopt) : "vce(cluster rep78)" r(vceargs) : "rep78" r(vce) : "cluster"
Having illustrated how to make _vce_parse handle the cases when the user specifies something valid, I will show in example 7 that it will also produce a standard error message when the user specifies an error condition.
Example 7: parsing vce(silly)
. local vce "silly" . capture noisily _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vc > e(`vce') vcetype 'silly' not allowed . return list
_vce_parse can parse other types of vce() options; to see them type help _vce_parse.
Also, remember to type undocumented when you are looking for a programmer’s tool.
The code for myregress12
Here is the code for myregress12.ado, which uses _vce_parse. I describe how it works below.
I recommend that you click on the file name to download the code for my myregress12.ado. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.
*! version 12.0.0 16Jan2016 program define myregress12, eclass sortpreserve version 14.1 syntax varlist(numeric ts fv) [if] [in] [, noCONStant vce(string) ] marksample touse _vce_parse `touse' , optlist(Robust) argoptlist(CLuster) : , vce(`vce') local vce "`r(vce)'" local clustervar "`r(cluster)'" if "`vce'" == "robust" | "`vce'" == "cluster" { local vcetype "Robust" } if "`clustervar'" != "" { capture confirm numeric variable `clustervar' if _rc { display in red "invalid vce() option" display in red "cluster variable {bf:`clustervar'} is " /// "string variable instead of a numeric variable" exit(198) } sort `clustervar' } gettoken depvar indepvars : varlist _fv_check_depvar `depvar' fvexpand `indepvars' local cnames `r(varlist)' tempname b V N rank df_r mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'", /// "`vce'", "`clustervar'", /// "`b'", "`V'", "`N'", "`rank'", "`df_r'") if "`constant'" == "" { local cnames `cnames' _cons } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn scalar df_r = `df_r' ereturn local vce "`vce'" ereturn local vcetype "`vcetype'" ereturn local clustvar "`clustervar'" ereturn local cmd "myregress12" ereturn display end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar vcetype, string scalar clustervar, string scalar bname, string scalar Vname, string scalar nname, string scalar rname, string scalar dfrname) { real vector y, b, e, e2, cvar, ei real matrix X, XpXi, M, info, xi real scalar n, p, k, nc, i, dfr y = st_data(., depvar, touse) X = st_data(., indepvars, touse) n = rows(X) if (constant == "") { X = X,J(n,1,1) } XpXi = quadcross(X, X) XpXi = invsym(XpXi) b = XpXi*quadcross(X, y) e = y - X*b e2 = e:^2 p = cols(X) k = p - diag0cnt(XpXi) if (vcetype == "robust") { M = quadcross(X, e2, X) dfr = n - k V = (n/dfr)*XpXi*M*XpXi } else if (vcetype == "cluster") { cvar = st_data(., clustervar, touse) info = panelsetup(cvar, 1) nc = rows(info) M = J(k, k, 0) dfr = nc - 1 for(i=1; i<=nc; i++) { xi = panelsubmatrix(X,i,info) ei = panelsubmatrix(e,i,info) M = M + xi'*(ei*ei')*xi } V = ((n-1)/(n-k))*(nc/(nc-1))*XpXi*M*XpXi } else { // vcetype must IID dfr = n - k V = (quadsum(e2)/dfr)*XpXi } st_matrix(bname, b') st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, k) st_numscalar(dfrname, dfr) } end
Let’s break this 118-line program into familiar pieces. Lines 2-56 define the ado-command, and lines 58-118 define the Mata work function that is used by the ado-command. Despite the addition of details to handle the parsing and computation of a robust or cluster-robust VCE, the structures of the ado-command and of the Mata work function are the same as they were in myregress11.ado; see Programming an estimation command in Stata: An OLS command using Mata.
The ado-command has four parts.
The Mata function mywork() also has four parts.
Now, I address the details of the ado-code, although I do not discuss details in myregress12.ado, which I already covered when describing myregress11.ado in Programming an estimation command in Stata: An OLS command using Mata. Line 5 allows the user to specify the vce() option, and line 8 uses _vce_parse to parse what the user specifies. Lines 9 and 10 put the type of VCE found by _vce_parse in the local macro vce and the name of the cluster variable, if specified, in the local macro clustervar. Lines 11-13 put Robust in the local vcetype, if the specified vce is either robust or cluster. If there is a cluster variable, lines 14–23 check that it is numeric and use it to sort the data.
Line 34 passes the new arguments for the type of VCE and the name of the cluster variable to the Mata work function mywork().
Lines 49–51 store the type of VCE, the output label for the VCE type, and the name of the cluster variable in e(), respectively.
Now, I address the details of the Mata work function mywork() but only discussing what I have added to mywork() in myregress11.ado. Line 62 declares the new arguments. The string scalar vcetype is empty, or it contains “robust”, or it contains “cluster”. The string scalar clustervar is either empty or contains the name of the cluster variable.
Lines 68–70 declare the local-to-the-function vectors cvar and ei and the local-to-the-function matrices M, info, and xi that are needed now but not previously.
Lines 87, 91–92, 104–105, and 108 specify if-else blocks to compute the correct VCE. Lines 88–90 compute a robust estimator of the VCE if vcetype contains “robust”. Lines 93–103 compute a cluster-robust estimator of the VCE if vcetype contains “cluster”. Lines 106–107 compute an IID-based estimator of the VCE if vcetype contains neither “robust” nor “cluster”.
Done and undone
I introduced the undocumented command _vce_parse and discussed the code for myregress12.ado, which uses Mata to compute OLS point estimates and an IID-based VCE, a robust VCE, or a cluster-robust VCE.
The structure of the code is the same as the one that I used in myregress11.ado and in mymean8.ado, which I discussed in Programming an estimation command in Stata: An OLS command using Mata and in Programming an estimation command in Stata: A first ado-command using Mata. That the structure remains the same makes it easier to handle the details that arise in more complicated problems.
]]>Table of Contents
Overview
1PL model
2PL model
3PL model
4PL model
5PL model
Conclusion
Overview
Item response theory (IRT) is used for modeling the relationship between the latent abilities of a group of subjects and the examination items used for measuring their abilities. Stata 14 introduced a suite of commands for fitting IRT models using maximum likelihood; see, for example, the blog post Spotlight on irt by Rafal Raciborski and the [IRT] Item Response Theory manual for more details. In this post, we demonstrate how to fit Bayesian binary IRT models by using the redefine() option introduced for the bayesmh command in Stata 14.1. If you are not familiar with the concepts and jargon of Bayesian statistics, you may want to watch the introductory videos on the Stata Youtube channel before proceeding.
Introduction to Bayesian analysis, part 1 : The basic concepts
Introduction to Bayesian analysis, part 2: MCMC and the Metropolis-Hastings algorithm
We use the abridged version of the mathematics and science data from DeBoeck and Wilson (2004), masc1. The dataset includes 800 student responses to 9 test questions intended to measure mathematical ability.
The irt suite fits IRT models using data in the wide form – one observation per subject with items recorded in separate variables. To fit IRT models using bayesmh, we need data in the long form, where items are recorded as multiple observations per subject. We thus reshape the dataset in a long form: we have a single binary response variable, y, and two index variables, item and id, which identify the items and subjects, respectively. This allows us to formulate our IRT models as multilevel models. The following commands load and prepare the dataset.
. webuse masc1 (Data from De Boeck & Wilson (2004)) . generate id = _n . quietly reshape long q, i(id) j(item) . rename q y
To ensure that we include all levels of item and id in our models, we use fvset base none to keep the base categories.
. fvset base none id item
In what follows, we present eight Bayesian binary IRT models increasing in complexity and explanatory power. We perform Bayesian model comparison to gain insight into what would be the more appropriate model for the data at hand.
For high-dimensional models such as IRT models, you may see differences in the estimation results between different platforms or different flavors of Stata because of the nature of the Markov chain Monte Carlo (MCMC) sampling and finite numerical precision. These differences are not a source of concern; they will be within the range of the MCMC variability and will lead to similar inferential conclusions. The differences will diminish as the MCMC sample size increases. The results in this post are obtained from Stata/SE on the 64-bit Linux platform using the default 10,000 MCMC sample size.
Let the items be indexed by \(i=1,\dots,9\) and the subjects by \(j=1,\dots,800\). Let \(\theta_j\) be the latent mathematical ability of subject \(j\), and let \(Y_{ij}\) be the response of subject \(j\) to item \(i\).
1PL model
In the one-parameter logistic (1PL) model, the probability of getting a correct response is modeled as an inverse-logit function of location parameters \(b_i\), also called item difficulties, and a common slope parameter \(a\), also called item discrimination:
\[
P(Y_{ij}=1) = {\rm InvLogit}\{a(\theta_j-b_i)\} =
\frac{\exp\{a(\theta_j-b_i)\}}{1+\exp\{a(\theta_j-b_i)\}}
\]
Typically, the abilities are assumed to be normally distributed:
\[
\theta_j \sim {\rm N}(0,1)
\]
In a multilevel framework, the \(\theta_j\)’s represent random effects. In a Bayesian framework, we use the term “random effects” to refer to the parameters corresponding to levels of grouping variables identifying the hierarchy of the data.
A Bayesian formulation of the 1PL model also requires prior specification for the model parameters \(a\) and \(b_i\). The discrimination parameter \(a\) is assumed to be positive and is often modeled in the log scale. Because we have no prior knowledge about the discrimination and difficulty parameters, we assume that the prior distributions of \(\ln(a)\) and \(b_i\) have support on the whole real line, are symmetric, and are centered at 0. A normal prior distribution is thus a natural choice. We furthermore assume that \(\ln(a)\) and \(b_i\) are close to 0 and have prior variance of 1, which is an entirely subjective decision. We thus assign \(\ln(a)\) and \(b_i\) standard normal prior distributions:
\[\ln(a) \sim {\rm N}(0, 1)\] \[b_i \sim {\rm N}(0, 1) \]
To specify the likelihood function of the 1PL model in bayesmh, we use a nonlinear equation specification for the response variable y. The direct nonlinear specification for this model is
bayesmh y = ({discrim}*({subj:i.id}-{diff:i.item})), likelihood(logit) ...
where {discrim} is the discrimination parameter \(a\), {subj:i.id} are latent abilities \(\theta_j\), and {diff:i.item} are item difficulties \(b_i\). The logit model is used for the probability of a success, \(P(Y_{ij}=1)\). Specification {subj:i.id} in the above nonlinear expression is viewed as a substitutable expression for linear combinations of indicators associated with the id variable and parameters \(\theta_j\). This specification may be computationally prohibitive with a large number of subjects. A more efficient solution is to use the redefine() option to include subject random effects \(\theta_j\) in the model. The same argument may apply to the {diff:i.item} specification when there are many items. Thus, it may be computationally convenient to treat the \(b_i\) parameters as “random effects” in the specification and use the redefine() option to include them in the model.
A more efficient specification is thus
bayesmh y = ({discrim}*({subj:}-{diff:})), likelihood(logit) /// redefine(subj:i.id) redefine(diff:i.item) ... ///
where {subj:} and {diff:} in the nonlinear specification now represent the \(\theta_j\) and \(b_i\) parameters, respectively, without using expansions into linear combinations of indicator variables.
Below, we show the full bayesmh specification of the 1PL model and the output summary. In our examples, we treat the abilities {subj:i.id} as nuisance parameters and exclude them from the final results. The discrimination model parameter {discrim} must be positive and is thus initialized with 1. A longer burn-in period, burnin(5000), allows for longer adaptation of the MCMC sampler, which is needed given the large number of parameters in the model. Finally, the estimation results are stored for later model comparison.
. set seed 14 . bayesmh y = ({discrim}*({subj:}-{diff:})), likelihood(logit) /// > redefine(diff:i.item) redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim}, lognormal(0, 1)) /// > prior({diff:i.item}, normal(0, 1)) /// > init({discrim} 1) exclude({subj:i.id}) /// > burnin(5000) saving(sim1pl, replace) Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ logit({discrim}*(xb_subj-xb_diff)) Priors: {diff:i.item} ~ normal(0,1) (1) {subj:i.id} ~ normal(0,1) (2) {discrim} ~ lognormal(0,1) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_diff. (2) Parameters are elements of the linear form xb_subj. Bayesian logistic regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3074 Efficiency: min = .02691 avg = .06168 Log marginal likelihood = . max = .09527 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- diff | item | 1 | -.6934123 .0998543 .003576 -.6934789 -.8909473 -.4917364 2 | -.1234553 .0917187 .002972 -.1241642 -.3030341 .0597863 3 | -1.782762 .1323252 .00566 -1.781142 -2.05219 -1.534451 4 | .3152835 .0951978 .003289 .3154714 .1279147 .4981263 5 | 1.622545 .127213 .005561 1.619388 1.377123 1.883083 6 | .6815517 .0978777 .003712 .6788345 .4911366 .881128 7 | 1.303482 .1173994 .005021 1.302328 1.084295 1.544913 8 | -2.353975 .1620307 .008062 -2.351207 -2.672983 -2.053112 9 | -1.168668 .1120243 .004526 -1.163922 -1.392936 -.9549209 -------------+---------------------------------------------------------------- discrim | .8644787 .0439804 .002681 .8644331 .7818035 .9494433 ------------------------------------------------------------------------------ file sim1pl.dta saved . estimates store est1pl
The sampling efficiency is acceptable, about 6% on average, with no indication of convergence problems. Although detailed convergence inspection of all parameters is outside the scope of this post, we recommend that you do so by using, for example, the bayesgraph diagnostics command.
Though we used informative priors for the model parameters, the estimation results from our Bayesian model are not that different from the maximum likelihood estimates obtained using the irt 1pl command (see example 1 in [IRT] irt 1pl). For example, the posterior mean estimate for {discrim} is 0.86 with an MCMC standard error of 0.003, whereas irt 1pl reports 0.85 with a standard error of 0.05.
The log-marginal likelihood is reported missing because we have excluded the {subj:i.id} parameters from the simulation results and the Laplace-Metropolis estimator of the log-marginal likelihood is not available in such cases. This estimator requires simulation results for all model parameters to compute the log-marginal likelihood.
2PL model
The two-parameter logistic (2PL) model extends the 1PL model by allowing for item-specific discrimination. The probability of correct response is now modeled as a function of item-specific slope parameters \(a_i\):
\[
P(Y_{ij}=1) = {\rm InvLogit}\{a_i(\theta_j-b_i)\} =
\frac{\exp\{a_i(\theta_j-b_i)\}}{1+\exp\{a_i(\theta_j-b_i)\}}
\]
The prior specification for \(\theta_j\) stays the same as in the 1PL model. We will, however, apply more elaborate prior specifications for the \(a_i\)’s and \(b_i\)’s. It is a good practice to use proper prior specifications without overwhelming the evidence from the data. The impact of the priors can be controlled by introducing additional hyperparameters. For example, Kim and Bolt (2007) proposed the use of a normal prior for the difficulty parameters with unknown mean and variance. Extending this approach to the discrimination parameters as well, we apply a hierarchical Bayesian model in which the \(\ln(a_i)\) and \(b_i\) parameters have the following prior specifications:
\[ \ln(a_i) \sim {\rm N}(\mu_a, \sigma_a^2) \] \[ b_i \sim {\rm N}(\mu_b, \sigma_b^2) \]
The mean hyperparameters, \(\mu_a\) and \(\mu_b\), and variance hyperparameters, \(\sigma_a^2\) and \(\sigma_b^2\), require informative prior specifications. We assume that the means are centered at 0 with a variation of 0.1:
\[
\mu_a, \mu_b \sim {\rm N}(0, 0.1)
\]
To lower the variability of the \(\ln(a_i)\) and \(b_i\) parameters, we apply an inverse-gamma prior with shape 10 and scale 1 for the variance parameters:
\[
\sigma_a^2, \sigma_b^2 \sim {\rm InvGamma}(10, 1)
\]
Thus, the prior mean of \(\sigma_a^2\) and \(\sigma_b^2\) is about 0.1.
In the bayesmh specification, the hyperparameters \(\mu_a\), \(\mu_b\), \(\sigma_a^2\), and \(\sigma_a^2\) are denoted as {mu_a}, {mu_b}, {var_a}, and {var_b}, respectively. We use the redefine(discrim:i.item) option to include in the model the discrimination parameters \(a_i\), referred to as {discrim:} in the likelihood specification.
Regarding the MCMC simulation, we change some of the default options. The hyperparameters {mu_a}, {mu_b}, {var_a}, and {var_b} are placed in separate blocks to improve the simulation efficiency. The discrimination parameters {discrim:i.item} must be positive and are thus initialized with 1s.
. set seed 14 . bayesmh y = ({discrim:}*({subj:}-{diff:})), likelihood(logit) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b}, split) /// > init({discrim:i.item} 1) /// > exclude({subj:i.id}) burnin(5000) saving(sim2pl, replace) Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ logit(xb_discrim*(xb_subj-xb_diff)) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {subj:i.id} ~ normal(0,1) (3) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_subj. Bayesian logistic regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3711 Efficiency: min = .01617 avg = .04923 Log marginal likelihood = . max = .1698 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- discrim | item | 1 | 1.430976 .1986011 .010953 1.413063 1.089405 1.850241 2 | .6954823 .1081209 .004677 .6897267 .4985004 .9276975 3 | .9838528 .1343908 .009079 .9780275 .7506566 1.259427 4 | .8167792 .1169157 .005601 .8136229 .5992495 1.067578 5 | .9402715 .1351977 .010584 .9370298 .6691103 1.214885 6 | .9666747 .1420065 .008099 .9616285 .7038868 1.245007 7 | .5651287 .0864522 .006201 .5617302 .3956216 .7431265 8 | 1.354053 .2048404 .015547 1.344227 .9791096 1.761437 9 | .7065096 .1060773 .006573 .6999745 .5102749 .9271799 -------------+---------------------------------------------------------------- diff | item | 1 | -.5070314 .0784172 .003565 -.507922 -.671257 -.3596057 2 | -.1467198 .117422 .003143 -.1456633 -.3895978 .0716841 3 | -1.630259 .1900103 .013494 -1.612534 -2.033169 -1.304171 4 | .3273735 .1073891 .003565 .3231703 .1248782 .5492114 5 | 1.529584 .1969554 .01549 1.507982 1.202271 1.993196 6 | .6325194 .115724 .005613 .6243691 .4272131 .8851649 7 | 1.827013 .2884057 .019582 1.79828 1.349654 2.490633 8 | -1.753744 .1939559 .014743 -1.738199 -2.211475 -1.438146 9 | -1.384486 .2059005 .012105 -1.361195 -1.838918 -1.059687 -------------+---------------------------------------------------------------- mu_a | -.1032615 .1148176 .003874 -.102376 -.3347816 .1277031 var_a | .1129835 .0356735 .001269 .1056105 .063403 .1981331 mu_b | -.0696525 .2039387 .004949 -.072602 -.4641566 .3298393 var_b | .6216005 .2023137 .008293 .5843444 .3388551 1.101153 ------------------------------------------------------------------------------ file sim2pl.dta saved . estimates store est2pl
The average simulation efficiency is about 5%, but some of the parameters converge slower than the others, such as {diff:7.item}, which has the largest MCMC standard error (0.02) among the difficulty parameters. If this was a rigorous study, to lower the MCMC standard errors, we would recommend longer simulations with MCMC sample sizes of at least 50,000.
We can compare the 1PL and 2PL models by using the deviance information criterion (DIC) available with the bayesstats ic command.
. bayesstats ic est1pl est2pl, diconly Deviance information criterion ------------------------ | DIC -------------+---------- est1pl | 8122.428 est2pl | 8055.005 ------------------------
DIC is often used in Bayesian model selection as an alternative to AIC and BIC criteria and can be easily obtained from an MCMC sample. Larger MCMC samples produce more reliable DIC estimates. Because different MCMC samples produce different sample DIC values and the sample approximation error in calculating DIC is not known, one should not rely solely on DIC when choosing a model.
Lower DIC values indicate better fit. The DIC of the 2PL model (8,055) is markedly lower than the DIC of the 1PL model (8,122), implying better fit of the 2PL model.
3PL model
The three-parameter logistic (3PL) model introduces lower asymptote parameters \(c_i\), also called guessing parameters. The probability of giving a correct response is given by
\[
P(Y_{ij}=1) = c_i + (1-c_i){\rm InvLogit}\{a_i(\theta_j-b_i)\} ,\ c_i > 0
\]
The guessing parameters may be difficult to estimate using maximum likelihood. Indeed, the irt 3pl command with the sepguessing option fails to converge, as you can verify by typing
. irt 3pl q1-q9, sepguessing
on the original dataset.
It is thus important to specify an informative prior for \(c_i\). We assume that the prior mean of the guessing parameters is about 0.1 and thus apply
\[
c_i \sim {\rm InvGamma}(10, 1)
\]
Similarly to the discrimination and difficulty parameters, the \(c_i\)’s are introduced as random-effects parameters in the bayesmh specification and are referred to as {gues:} in the likelihood specification.
Unlike 1PL and 2PL models, we cannot use the likelihood(logit) option to model the probability of success because the probability of correct response is no longer an inverse-logit transformation of the parameters. Instead, we use likelihood(binlogit(1), noglmtransform) to model the probability of success of a Bernoulli outcome directly.
To have a valid initialization of the MCMC sampler, we assign the \(c_i\)’s positive starting values, 0.1.
. set seed 14 . bayesmh y = ({gues:}+(1-{gues:})*invlogit({discrim:}*({subj:}-{diff:}))), /// > likelihood(binlogit(1), noglmtransform) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(gues:i.item) redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({gues:i.item}, igamma(10, 1)) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b}, split) /// > init({discrim:i.item} 1 {gues:i.item} 0.1) /// > exclude({subj:i.id}) burnin(5000) saving(sim3pls, replace) Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ binomial(xb_gues+(1-xb_gues)*invlogit(xb_discrim*(xb_subj-xb_diff)),1) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {gues:i.item} ~ igamma(10,1) (3) {subj:i.id} ~ normal(0,1) (4) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_gues. (4) Parameters are elements of the linear form xb_subj. Bayesian binomial regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3496 Efficiency: min = .0148 avg = .03748 Log marginal likelihood = . max = .2044 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- discrim | item | 1 | 1.712831 .2839419 .018436 1.681216 1.232644 2.351383 2 | .8540871 .1499645 .008265 .8414399 .6058463 1.165732 3 | 1.094723 .1637954 .01126 1.081756 .817031 1.454845 4 | 1.090891 .2149095 .013977 1.064651 .7488589 1.588164 5 | 1.363236 .2525573 .014858 1.338075 .9348136 1.954695 6 | 1.388325 .3027436 .024245 1.336303 .9466695 2.068181 7 | .9288217 .2678741 .021626 .8750048 .5690308 1.603375 8 | 1.457763 .2201065 .01809 1.438027 1.068937 1.940431 9 | .7873631 .127779 .007447 .7796568 .563821 1.06523 -------------+---------------------------------------------------------------- diff | item | 1 | -.2933734 .0976177 .006339 -.2940499 -.4879558 -.0946848 2 | .2140365 .157158 .008333 .2037788 -.0553537 .5550411 3 | -1.326351 .1981196 .013101 -1.326817 -1.706671 -.9307443 4 | .6367877 .1486799 .007895 .6277349 .3791045 .9509913 5 | 1.616056 .1799378 .00966 1.606213 1.303614 2.006817 6 | .8354059 .124184 .00656 .8191839 .614221 1.097801 7 | 2.066205 .3010858 .018377 2.034757 1.554484 2.709601 8 | -1.555583 .1671435 .012265 -1.54984 -1.89487 -1.267001 9 | -.9775626 .2477279 .016722 -.9936727 -1.431964 -.4093629 -------------+---------------------------------------------------------------- gues | item | 1 | .1078598 .0337844 .0019 .1020673 .0581353 .1929404 2 | .1128113 .0372217 .002162 .1065996 .0596554 .2082417 3 | .123031 .0480042 .002579 .1127147 .0605462 .2516237 4 | .1190103 .0390721 .002369 .1123544 .0617698 .2095427 5 | .0829503 .0185785 .001275 .0807116 .0514752 .1232547 6 | .1059315 .0289175 .001708 .1022741 .0584959 .1709483 7 | .1235553 .0382661 .002964 .1186648 .0626495 .2067556 8 | .1142118 .0408348 .001733 .1062507 .0592389 .2134006 9 | .1270767 .0557821 .003939 .113562 .0621876 .2825752 -------------+---------------------------------------------------------------- mu_a | .109161 .1218499 .005504 .1126253 -.135329 .3501061 var_a | .108864 .0331522 .001053 .1030106 .0604834 .1860996 mu_b | .0782094 .1974657 .004367 .0755023 -.3067717 .4638104 var_b | .5829738 .1803167 .006263 .5562159 .3260449 1.034225 ------------------------------------------------------------------------------ file sim3pls.dta saved . estimates store est3pls
The estimated posterior means of the \(c_i\)’s range between 0.08 and 0.13. Clearly, the introduction of guessing parameters has an impact on the item discrimination and difficulty parameters. For example, the estimated posterior means of \(\mu_a\) and \(\mu_b\) shift from -0.10 and -0.07, respectively, for the 2PL model to 0.11 and 0.08, respectively, for the 3PL model.
Because the estimated guessing parameters are not that different, one may ask whether item-specific guessing parameters are really necessary. To answer this question, we fit a model with a common guessing parameter, {gues}, and compare it with the previous model.
. set seed 14 . bayesmh y = ({gues}+(1-{gues})*invlogit({discrim:}*({subj:}-{diff:}))), /// > likelihood(binlogit(1), noglmtransform) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({gues}, igamma(10, 1)) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b gues}, split) /// > init({discrim:i.item} 1 {gues} 0.1) /// > exclude({subj:i.id}) burnin(5000) saving(sim3pl, replace) Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ binomial({gues}+(1-{gues})*invlogit(xb_discrim*(xb_subj-xb_diff)),1) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {subj:i.id} ~ normal(0,1) (3) {gues} ~ igamma(10,1) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_subj. Bayesian binomial regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3753 Efficiency: min = .01295 avg = .03714 Log marginal likelihood = . max = .1874 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- discrim | item | 1 | 1.692894 .2748163 .021944 1.664569 1.232347 2.299125 2 | .8313512 .1355267 .00606 .8218212 .5928602 1.125729 3 | 1.058833 .1611742 .014163 1.054126 .7676045 1.393611 4 | 1.041808 .1718472 .008782 1.029867 .7398569 1.397073 5 | 1.534997 .3208687 .023965 1.497019 1.019998 2.266078 6 | 1.38296 .2581948 .019265 1.355706 .9559487 1.979358 7 | .8310222 .1698206 .012896 .8107371 .5736484 1.248736 8 | 1.442949 .2266268 .017562 1.431204 1.066646 1.930829 9 | .77944 .1159669 .007266 .7750891 .5657258 1.014941 -------------+---------------------------------------------------------------- diff | item | 1 | -.3043161 .0859905 .005373 -.2968324 -.4870583 -.1407109 2 | .1814508 .1289251 .006543 .1832146 -.0723988 .4313265 3 | -1.391216 .1924384 .014986 -1.373093 -1.809343 -1.050919 4 | .5928491 .1262631 .006721 .5829347 .356614 .857743 5 | 1.617348 .1929263 .011604 1.601534 1.293032 2.061096 6 | .817635 .1172884 .006125 .812838 .5990503 1.064322 7 | 2.006949 .2743517 .01785 1.981052 1.556682 2.594236 8 | -1.576235 .1747855 .013455 -1.559435 -1.952676 -1.272108 9 | -1.039362 .1840773 .01138 -1.02785 -1.432058 -.7160181 -------------+---------------------------------------------------------------- gues | .1027336 .0214544 .001753 .1022211 .0627299 .1466367 mu_a | .1009741 .123915 .006567 .0965353 -.1343028 .3510697 var_a | .1121003 .0344401 .001154 .1059563 .0628117 .1970842 mu_b | .0632173 .1979426 .004572 .0666684 -.3292497 .4482957 var_b | .5861236 .1818885 .006991 .5574743 .3239369 1.053172 ------------------------------------------------------------------------------ file sim3pl.dta saved . estimates store est3pl
We can again compare the two 3PL models by using the bayesstats ic command:
. bayesstats ic est3pls est3pl, diconly Deviance information criterion ------------------------ | DIC -------------+---------- est3pls | 8049.425 est3pl | 8049.426 ------------------------
Although the estimated DICs of the two 3PL models are essentially the same, we decide for demonstration purposes to proceed with the model with item-specific guessing parameters.
4PL model
The four-parameter logistic (4PL) model extends the 3PL model by adding item-specific upper asymptote parameters \(d_i\):
\[
P(Y_{ij}=1) = c_i + (d_i-c_i){\rm InvLogit}\{a_i(\theta_j-b_i)\}
,\ c_i < d_i < 1
\]
The \(d_i\) parameter can be viewed as an upper limit on the probability of correct response to the \(i\)th item. The probability of giving correct answers by subjects with very high ability can thus be no greater than \(d_i\).
We restrict the \(d_i\)'s to the (0.8,1) range and assign them a \({\rm Uniform}(0.8,1)\) prior. For other parameters, we use the same priors as in the 3PL model.
In the bayesmh specification of the model, the condition \(c_i < d_i\) is incorporated in the likelihood, and the condition \(d_i < 1\) is implied by the specified prior for the \(d_i\)'s. We initialize the \(d_i\)'s to 0.9. We use the notable option to suppress the long table output.
. set seed 14 . bayesmh y = /// > (({gues:}+({d:}-{gues:})*invlogit({discrim:}*({subj:}-{diff:})))* /// > cond({gues:}<{d:},1,.)), likelihood(binlogit(1), noglmtransform) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(gues:i.item) redefine(d:i.item) redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({gues:i.item}, igamma(10, 1)) /// > prior({d:i.item}, uniform(0.8, 1)) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b}, split) /// > init({discrim:i.item} 1 {gues:i.item} 0.1 {d:i.item} 0.9) /// > exclude({subj:i.id}) burnin(5000) saving(sim4pls, replace) notable Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ binomial(<expr1>,1) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {gues:i.item} ~ igamma(10,1) (3) {d:i.item} ~ uniform(0.8,1) (4) {subj:i.id} ~ normal(0,1) (5) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) Expression: expr1 : (xb_gues+(xb_d-xb_gues)*invlogit(xb_discrim*(xb_subj-xb_diff)))* con d(xb_gues<xb_d,1,.) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_gues. (4) Parameters are elements of the linear form xb_d. (5) Parameters are elements of the linear form xb_subj. Bayesian binomial regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3639 Efficiency: min = .004825 avg = .0219 Log marginal likelihood = . max = .1203 Note: There is a high autocorrelation after 500 lags. file sim4pls.dta saved . estimates store est4pls
We use bayesstats summary to display results of selected model parameters.
. bayesstats summary {d:i.item} {mu_a var_a mu_b var_b} Posterior summary statistics MCMC sample size = 10,000 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- d | item | 1 | .9598183 .0255321 .001948 .9621874 .9044441 .9981723 2 | .9024564 .0565702 .007407 .9019505 .8066354 .9944216 3 | .9525519 .0281878 .002845 .9551054 .8972454 .9971564 4 | .8887963 .0561697 .005793 .8859503 .8036236 .9916784 5 | .8815547 .0588907 .007215 .8708021 .8031737 .9926549 6 | .8891188 .0586482 .006891 .881882 .8024593 .9935512 7 | .874271 .0561718 .008087 .8635082 .8018176 .9880433 8 | .9663644 .0147606 .001121 .9667563 .9370666 .9950912 9 | .889164 .0486038 .005524 .8834207 .8084921 .9857415 -------------+---------------------------------------------------------------- mu_a | .3336887 .1436216 .009742 .334092 .0562924 .6164115 var_a | .1221547 .0406908 .002376 .1144729 .0642768 .2229326 mu_b | -.0407488 .1958039 .005645 -.0398847 -.4220523 .3323791 var_b | .4991736 .1612246 .00629 .4660071 .2802531 .9023824 ------------------------------------------------------------------------------
The bayesmh command issued a note indicating high autocorrelation for some of the model parameters. This may be related to slower MCMC convergence or more substantial problems in the model specification. It is thus worthwhile to inspect the individual autocorrelation of the parameters. We can do so by using the bayesstats ess command. The parameters with lower estimated sample size (ESS) have higher autocorrelation and vice versa.
. bayesstats ess {d:i.item} {mu_a var_a mu_b var_b} Efficiency summaries MCMC sample size = 10,000 ---------------------------------------------------- | ESS Corr. time Efficiency -------------+-------------------------------------- d | item | 1 | 171.82 58.20 0.0172 2 | 58.33 171.43 0.0058 3 | 98.17 101.87 0.0098 4 | 94.02 106.36 0.0094 5 | 66.62 150.11 0.0067 6 | 72.44 138.05 0.0072 7 | 48.25 207.26 0.0048 8 | 173.30 57.70 0.0173 9 | 77.41 129.19 0.0077 -------------+-------------------------------------- mu_a | 217.35 46.01 0.0217 var_a | 293.34 34.09 0.0293 mu_b | 1203.20 8.31 0.1203 var_b | 656.92 15.22 0.0657 ----------------------------------------------------
We observe that the parameters with ESS lower than 200 are among the asymptote parameter’s \(d_i\)’s. This may be caused, for example, by overparameterization of the likelihood model and subsequent nonidentifiability, which is not resolved by the specified priors.
We can also fit a model with a common upper asymptote parameter, \(d\), and compare it with the model with the item-specific upper asymptote.
. set seed 14 . bayesmh y = /// > (({gues:}+({d}-{gues:})*invlogit({discrim:}*({subj:}-{diff:})))* /// > cond({gues:}<{d},1,.)), likelihood(binlogit(1), noglmtransform) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(gues:i.item) redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({gues:i.item}, igamma(10, 1)) /// > prior({d}, uniform(0.8, 1)) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b d}, split) /// > init({discrim:i.item} 1 {gues:i.item} 0.1 {d} 0.9) /// > exclude({subj:i.id}) burnin(5000) saving(sim4pl, replace) notable Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ binomial(<expr>>,1) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {gues:i.item} ~ igamma(10,1) (3) {subj:i.id} ~ normal(0,1) (4) {d} ~ uniform(0.8,1) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) Expression: expr1 : (xb_gues+({d}-xb_gues)*invlogit(xb_discrim*(xb_subj-xb_diff)))* cond (xb_gues<{d},1,.) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_gues. (4) Parameters are elements of the linear form xb_subj. Bayesian binomial regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3877 Efficiency: min = .0107 avg = .03047 Log marginal likelihood = . max = .1626 file sim4pl.dta saved . estimates store est4pl . bayesstats summary {d mu_a var_a mu_b var_b} Posterior summary statistics MCMC sample size = 10,000 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- d | .9664578 .0144952 .001293 .9668207 .9371181 .9924572 mu_a | .2206696 .1387873 .01113 .2208302 -.0483587 .4952625 var_a | .1245785 .0391551 .001806 .1188779 .0658243 .2187058 mu_b | .0371722 .2020157 .00501 .0331742 -.3481366 .4336587 var_b | .5603447 .1761812 .006817 .5279243 .3157048 .9805077 ------------------------------------------------------------------------------
We now compare the two 4PL models by using the bayesstats ic command:
. bayesstats ic est4pls est4pl, diconly Deviance information criterion ------------------------ | DIC -------------+---------- est4pls | 8050.805 est4pl | 8037.075 ------------------------
The DIC of the more complex 4PL model (8,051) is substantially higher than the DIC of the simpler model (8,037). This and the potential nonidentifiability of the more complex est4pls model, indicated by high autocorrelation in the simulated MCMC sample, compel us to proceed with the model with a common upper asymptote, est4pl.
The posterior distribution of \(d\) has an estimated 95% equal-tailed credible interval of (0.93, 0.99) and is concentrated about 0.97. The \({\rm Uniform}(0.8,1)\) prior on \(d\) does not seem to be too restrictive. The estimated DIC of the est4pl model (8,037) is lower than the DIC of the est3pls 3PL model from the previous section (8,049), implying that the introduction of the upper asymptote parameter \(d\) does improve the model fit.
5PL model
The five-parameter logistic (5PL) model extends the 4PL model by adding item-specific asymmetry parameters \(e_i\):
\[
P(Y_{ij}=1) = c_i + (d_i-c_i){\rm InvLogit}\big[{\{a_i(\theta_j-b_i)\}}^{e_i}\big]
,\ c_i < d_i < 1, \ 0 < e_i < 1
\]
In the previous section, we found the 4PL model with common upper asymptote \(d\), est4pl, to be the best one so far. We thus consider here a 5PL model with common upper asymptote \(d\).
Typically, we expect the \(e_i\) parameters to be close to 1. Similarly to the upper asymptote parameter \(d\), the \(e_i\) parameters are assumed to be in the (0.8,1) range and are assigned \({\rm Uniform}(0.8,1)\) prior. We initialize the \(e_i\)s to 0.9. We again use the notable option to suppress the long table output, and we display a subset of results by using bayesstats summary. (We could have used bayesmh's noshow() option instead to achieve the same result.)
. set seed 14 . bayesmh y = /// > (({gues:}+({d}-{gues:})*(invlogit({discrim:}*({subj:}-{diff:})))^{e:})* /// > cond({gues:}<{d},1,.)), likelihood(binlogit(1), noglmtransform) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(gues:i.item) redefine(e:i.item) redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({gues:i.item}, igamma(10, 1)) /// > prior({d}, uniform(0.8, 1)) /// > prior({e:i.item}, uniform(0.8, 1)) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b d}, split) /// > init({discrim:i.item} 1 {gues:i.item} 0.1 {d} {e:i.item} 0.9) /// > exclude({subj:i.id}) burnin(5000) saving(sim5pls, replace) notable Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ binomial(<expr1>,1) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {gues:i.item} ~ igamma(10,1) (3) {e:i.item} ~ uniform(0.8,1) (4) {subj:i.id} ~ normal(0,1) (5) {d} ~ uniform(0.8,1) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) Expression: expr1 : (xb_gues+({d}-xb_gues)*(invlogit(xb_discrim*(xb_subj-xb_diff)))^xb_e )* cond(xb_gues<{d},1,.) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_gues. (4) Parameters are elements of the linear form xb_e. (5) Parameters are elements of the linear form xb_subj. Bayesian binomial regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3708 Efficiency: min = .007341 avg = .02526 Log marginal likelihood = . max = .1517 file sim5pls.dta saved . estimates store est5pls . bayesstats summary {e:i.item} {d mu_a var_a mu_b var_b} Posterior summary statistics MCMC sample size = 10,000 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- e | item | 1 | .897859 .0578428 .006083 .8939272 .8050315 .9957951 2 | .9042669 .0585023 .005822 .90525 .8053789 .9956565 3 | .88993 .0562398 .005013 .887011 .803389 .9930454 4 | .9010241 .0574186 .006492 .9042044 .8030981 .9925598 5 | .9126369 .0545625 .00521 .9178927 .8098596 .9964487 6 | .9037269 .0583833 .006814 .9086704 .8054932 .9961268 7 | .9136308 .0558911 .005373 .9203899 .8112029 .996217 8 | .889775 .0568656 .005119 .8849938 .803912 .9938777 9 | .8808435 .056257 .004743 .8727194 .8030522 .9904972 -------------+---------------------------------------------------------------- d | .9671374 .0144004 .001165 .9670598 .9382404 .9933374 mu_a | .2770211 .1353777 .00832 .2782552 .0141125 .5418087 var_a | .122635 .0404159 .002148 .1160322 .0666951 .2208711 mu_b | .1211885 .1929743 .004955 .1199136 -.2515431 .503733 var_b | .5407642 .1747674 .006353 .5088269 .3016315 .9590086 ------------------------------------------------------------------------------
We also want to compare the above model with a simpler one using a common asymmetry parameter \(e\).
. set seed 14 . bayesmh y = /// > (({gues:}+({d}-{gues:})*(invlogit({discrim:}*({subj:}-{diff:})))^{e})* /// > cond({gues:}<{d},1,.)), likelihood(binlogit(1), noglmtransform) /// > redefine(discrim:i.item) redefine(diff:i.item) /// > redefine(gues:i.item) redefine(subj:i.id) /// > prior({subj:i.id}, normal(0, 1)) /// > prior({discrim:i.item}, lognormal({mu_a}, {var_a})) /// > prior({diff:i.item}, normal({mu_b}, {var_b})) /// > prior({gues:i.item}, igamma(10, 1)) /// > prior({d} {e}, uniform(0.8, 1)) /// > prior({mu_a} {mu_b}, normal(0, 0.1)) /// > prior({var_a} {var_b}, igamma(10, 1)) /// > block({mu_a mu_b var_a var_b d e}, split) /// > init({discrim:i.item} 1 {gues:i.item} 0.1 {d e} 0.9) /// > exclude({subj:i.id}) burnin(5000) saving(sim5pl, replace) notable Burn-in ... Simulation ... Model summary ------------------------------------------------------------------------------ Likelihood: y ~ binomial(<expr1>,1) Priors: {discrim:i.item} ~ lognormal({mu_a},{var_a}) (1) {diff:i.item} ~ normal({mu_b},{var_b}) (2) {gues:i.item} ~ igamma(10,1) (3) {subj:i.id} ~ normal(0,1) (4) {d e} ~ uniform(0.8,1) Hyperpriors: {mu_a mu_b} ~ normal(0,0.1) {var_a var_b} ~ igamma(10,1) Expression: expr1 : (xb_gues+({d}-xb_gues)*(invlogit(xb_discrim*(xb_subj-xb_diff)))^{e}) * cond(xb_gues<{d},1,.) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_discrim. (2) Parameters are elements of the linear form xb_diff. (3) Parameters are elements of the linear form xb_gues. (4) Parameters are elements of the linear form xb_subj. Bayesian binomial regression MCMC iterations = 15,000 Random-walk Metropolis-Hastings sampling Burn-in = 5,000 MCMC sample size = 10,000 Number of obs = 7,200 Acceptance rate = .3805 Efficiency: min = .008179 avg = .02768 Log marginal likelihood = . max = .08904 file sim5pl.dta saved . estimates store est5pl . bayesstats summary {e d mu_a var_a mu_b var_b} Posterior summary statistics MCMC sample size = 10,000 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- e | .9118363 .0558178 .004194 .9175841 .8063153 .9960286 d | .9655166 .0147373 .001495 .9659029 .9354708 .9924492 mu_a | .2674271 .1368926 .008485 .270597 .0102798 .5443345 var_a | .1250759 .0428095 .002635 .1173619 .0654135 .2340525 mu_b | .1015121 .2048178 .006864 .103268 -.3052377 .4934158 var_b | .5677309 .1824591 .006981 .5331636 .3079868 1.016762 ------------------------------------------------------------------------------
We use bayesstats ic to compare the DIC values of the two 5PL models:
. bayesstats ic est5pls est5pl, diconly Deviance information criterion ------------------------ | DIC -------------+---------- est5pls | 8030.894 est5pl | 8034.517 ------------------------
The estimated DIC of the more complex est5pls model (8,031) is lower than the DIC of the simpler model (8,035), suggesting a better fit.
Conclusion
Finally, we compare all eight fitted models.
. bayesstats ic est1pl est2pl est3pl est3pls est4pl est4pls est5pl est5pls, /// > diconly Deviance information criterion ------------------------ | DIC -------------+---------- est1pl | 8122.428 est2pl | 8055.005 est3pl | 8049.426 est3pls | 8049.425 est4pl | 8037.075 est4pls | 8050.805 est5pl | 8034.517 est5pls | 8030.894 ------------------------
The est5pls model has the lowest overall DIC. To confirm this result, we run another set of simulations with a larger MCMC sample size of 50,000. (We simply added the mcmcsize(50000) option to the bayesmh specification of the above eight models.) The following DIC values, based on the larger MCMC sample size, are more reliably estimated.
. bayesstats ic est1pl est2pl est3pl est3pls est4pl est4pls est5pl est5pls, /// > diconly Deviance information criterion ------------------------ | DIC -------------+---------- est1pl | 8124.015 est2pl | 8052.068 est3pl | 8047.067 est3pls | 8047.738 est4pl | 8032.417 est4pls | 8049.712 est5pl | 8031.375 est5pls | 8031.905 ------------------------
Again, the 5PL models have the lowest DIC values and seem to provide the best fit. However, the DIC differences between models est4pl, est5pl, and est5pls are minimal and may very well be within the estimation error. Regardless, these three models appear to be better than the simpler 1PL, 2PL, and 3PL models.
Additional model checking may be needed to assess the models' fit, and we should not rely solely on the DIC values to make our final model selection. A practitioner may still prefer the simpler est4pl 4PL model to the 5PL models even though it has a slightly higher DIC. In fact, given that the posterior mean estimate of the upper asymptote parameter \(d\) is 0.96 with a 95% equal-tailed credible interval of (0.94, 0.99), some practitioners may prefer the even simpler est3pl 3PL model.
References
De Boeck, P., and M. Wilson, ed. 2004. Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. New York: Springer.
Kim, J.-S., and D. M. Bolt. 2007. Estimating item response theory models using Markov chain Monte Carlo methods. Educational Measurement: Issues and Practice 26: 38-51.
]]>Programming estimators in Stata: Why you should
To help you write Stata commands that people want to use, I illustrate how Stata syntax is predictable and give an overview of the estimation-postestimation structure that you will want to emulate in your programs.
Programming an estimation command in Stata: Where to store your stuff
I discuss the difference between scripts and commands, and I introduce some essential programming concepts and constructions that I use to write the scripts and commands.
Programming an estimation command in Stata: Global macros versus local macros
I discuss a pair of examples that illustrate the differences between global macros and local macros.
Programming an estimation command in Stata: A first ado-command
I discuss the code for a simple estimation command to focus on the details of how to implement an estimation command. The command that I discuss estimates the mean by the sample average. I begin by reviewing the formulas and a do-file that implements them. I subsequently introduce ado-file programming and discuss two versions of the command. Along the way, I illustrate some of the postestimation features that work after the command.
I present the formulas for computing the ordinary least-squares (OLS) estimator, and I discuss some do-file implementations of them. I discuss the formulas and the computation of independence-based standard errors, robust standard errors, and cluster-robust standard errors. I introduce the Stata matrix commands and matrix functions that I use in ado-commands that I discuss in upcoming posts.
Programming an estimation command in Stata: A first command for OLS
I show how to write a Stata estimation command that implements the OLS estimator by explaining the code.
Programming an estimation command in Stata: A better OLS command
I use the syntax command to improve the command that implements the OLS estimator that I discussed in Programming an estimation command in Stata: A first command for OLS. I show how to require that all variables be numeric variables and how to make the command accept time-series operated variables.
Programming an estimation command in Stata: Allowing for sample restrictions and factor variables
I modify the OLS command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables.
Programming an estimation command in Stata: Allowing for options
I make three improvements to the command that implements the OLS estimator that I discussed in Programming an estimation command in Stata: Allowing for sample restrictions and factor variables. First, I allow the user to request a robust estimator of the variance-covariance of the estimator. Second, I allow the user to suppress the constant term. Third, I store the residual degrees of freedom in e(df_r) so that test will use the t or F distribution instead of the normal or chi-squared distribution to compute the p-value of Wald tests.
Programming an estimation command in Stata: Using a subroutine to parse a complex option
I make two improvements to the command that implements the OLS estimator that I discussed in Programming an estimation command in Stata: Allowing for options. First, I add an option for a cluster-robust estimator of the variance -covariance of the estimator (VCE). Second, I make the command accept the modern syntax for either a robust or a cluster-robust estimator of the VCE. In the process, I use subroutines in my ado-program to facilitate the parsing, and I discuss some advanced parsing tricks.
Programming an estimation command in Stata: Mata 101
I introduce Mata, the matrix programming language that is part of Stata.
Programming an estimation command in Stata: Mata functions
I show how to write a function in Mata, the matrix programming language that is part of Stata.
Programming an estimation command in Stata: A first ado-command using Mata
I discuss a sequence of ado-commands that use Mata to estimate the mean of a variable. The commands illustrate a general structure for Stata/Mata programs.
Programming an estimation command in Stata: Computing OLS objects in Mata
I present the formulas for computing the OLS estimator and show how to compute them in Mata. This post is a Mata version of Programming an estimation command in Stata: Using Stata matrix commands and functions to compute OLS objects. I discuss the formulas and the computation of independence-based standard errors, robust standard errors, and cluster-robust standard errors.
Programming an estimation command in Stata: An OLS command using Mata
I discuss a command that computes OLS results in Mata, paying special attention to the structure of Stata programs that use Mata work functions.
I show how to use the undocumented command _vce_parse to parse the options for robust or cluster-robust estimators of the VCE. I then discuss myregress12.ado, which performs its computations in Mata and computes an IID-based, a robust, or a cluster-robust estimator of the VCE.
Programming an estimation command in Stata: A review of nonlinear optimization using Mata
I review the theory behind nonlinear optimization and get some practice in Mata programming by implementing an optimizer in Mata. This post is designed to help you develop your Mata programming skills and to improve your understanding of how the Mata optimization suites optimize() and moptimize() work.
Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters
I show how to use optimize() in Mata to maximize a Poisson log-likelihood function and to obtain estimators of the VCE based on IID observations or on robust methods.
Programming an estimation command in Stata: A poisson command using Mata
I discuss mypoisson1, which computes Poisson-regression results in Mata. The code in mypoisson1.ado is remarkably similar to the code in myregress11.ado, which computes OLS results in Mata, as I discussed in Programming an estimation command in Stata: An OLS command using Mata
My simulations show that when the true model is a probit or a logit, using a linear probability model can produce inconsistent estimates of the marginal effects of interest to researchers. The conclusions hinge on the probit or logit model being the true model.
Simulation results
For all simulations below, I use a sample size of 10,000 and 5,000 replications. The true data-generating processes (DGPs) are constructed using one discrete covariate and one continuous covariate. I study the average effect of a change in the continuous variable on the conditional probability (AME) and the average effect of a change in the discrete covariate on the conditional probability (ATE). I also look at the effect of a change in the continuous variable on the conditional probability, evaluated at the mean value of the covariates (MEM), and the effect of a change in the discrete covariate on the conditional probability, evaluated at the mean value of the covariates (TEM).
In Table 1, I present the results of a simulation when the true DGP satisfies the assumptions of a logit model. I show the average of the AME and the ATE estimates and the 5% rejection rate of the true null hypotheses. I also provide an approximate true value of the AME and ATE. I obtain the approximate true values by computing the ATE and AME, at the true values of the coefficients, using a sample of 20 million observations. I will provide more details on the simulation in a later section.
Table 1: Average Marginal and Treatment Effects: True DGP Logit
Statistic | Approximate True Value | Logit | Regress (LPM) |
---|---|---|---|
AME of x1 | -.084 | -.084 | -.094 |
5% Rejection Rate | .050 | .99 | |
ATE of x2 | .092 | .091 | .091 |
5% Rejection Rate | .058 | .058 |
From Table 1, we see that the logit model estimates are close to the true value and that the rejection rate of the true null hypothesis is close to 5%. For the linear probability model, the rejection rate is 99% for the AME. For the ATE, the rejection rate and point estimates are close to what is estimated using a logit.
For the MEM and TEM, we have the following:
Table 2: Marginal and Treatment Eects at Mean Values: True DGP Logit
Statistic | Approximate True Value | Logit | Regress (LPM) |
---|---|---|---|
MEM of x1 | -.099 | -.099 | -.094 |
5% Rejection Rate | .054 | .618 | |
TEM of x2 | .109 | .109 | .092 |
5% Rejection Rate | .062 | .073 |
Again, logit estimates behave as expected. For the linear probability model, the rejection rate of the true null hypothesis is 62% for the MEM. For the TEM the rejection rate is 7.3%, and the estimated effect is smaller than the true effect.
For the AME and ATE, when the true GDP is a probit, we have the following:
Table 3: Average Marginal and Treatment Effects: True DGP Probit
Statistic | Approximate True Value | Probit | Regress (LPM) |
---|---|---|---|
AME of x1 | -.094 | -.094 | -.121 |
5% Rejection Rate | .047 | 1 | |
ATE of x2 | .111 | .111 | .111 |
5% Rejection Rate | .065 | .061 |
The probit model estimates are close to the true value, and the rejection rate of the true null hypothesis is close to 5%. For the linear probability model, the rejection rate is 100% for the AME. For the ATE, the rejection rate and point estimates are close to what is estimated using a probit.
For the MEM and TEM, we have the following:
Table 4: Marginal and Treatment Effects at Mean Values: True DGP Probit
Statistic | Approximate True Value | Probit | Regress (LPM) |
---|---|---|---|
MEM of x1 | -.121 | -.122 | -.121 |
5% Rejection Rate | .063 | .054 | |
TEM of x2 | .150 | .150 | .110 |
5% Rejection Rate | .059 | .158 |
For the MEM, the probit and linear probability model produce reliable inference. For the TEM, the probit marginal effects behave as expected, but the linear probability model has a rejection rate of 16%, and the point estimates are not close to the true value.
Simulation design
Below is the code I used to generate the data for my simulations. In the first part, lines 6 to 13, I generate outcome variables that satisfy the assumptions of the logit model, y, and the probit model, yp. In the second part, lines 15 to 19, I compute the marginal effects for the logit and probit models. I have a continuous and a discrete covariate. For the discrete covariate, the marginal effect is a treatment effect. In the third part, lines 21 to 29, I compute the marginal effects evaluated at the means. I will use these estimates later to compute approximations to the true values of the effects.
program define mkdata syntax, [n(integer 1000)] clear quietly set obs `n' // 1. Generating data from probit, logit, and misspecified generate x1 = rchi2(2)-2 generate x2 = rbeta(4,2)>.2 generate u = runiform() generate e = ln(u) -ln(1-u) generate ep = rnormal() generate xb = .5*(1 - x1 + x2) generate y = xb + e > 0 generate yp = xb + ep > 0 // 2. Computing probit & logit marginal and treatment effects generate m1 = exp(xb)*(-.5)/(1+exp(xb))^2 generate m2 = exp(1 -.5*x1)/(1+ exp(1 -.5*x1 )) - /// exp(.5 -.5*x1)/(1+ exp(.5 -.5*x1 )) generate m1p = normalden(xb)*(-.5) generate m2p = normal(1 -.5*x1 ) - normal(.5 -.5*x1) // 3. Computing marginal and treatment effects at means quietly mean x1 x2 matrix A = r(table) scalar a = .5 -.5*A[1,1] + .5*A[1,2] scalar b1 = 1 -.5*A[1,1] scalar b0 = .5 -.5*A[1,1] generate mean1 = exp(a)*(-.5)/(1+exp(a))^2 generate mean2 = exp(b1)/(1+ exp(b1)) - exp(b0)/(1+ exp(b0)) generate mean1p = normalden(a)*(-.5) generate mean2p = normal(b1) - normal(b0) end
I approximate the true marginal effects using a sample of 20 million observations. This is a reasonable strategy in this case. For example, take the average marginal effect for a continuous covariate, \(x_{k}\), in the case of the probit model:
\[\begin{equation*}
\frac{1}{N}\sum_{i=1}^N \phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}
\end{equation*}\]
The expression above is an approximation of \(E\left(\phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}\right)\). To obtain this expected value, we would need to integrate over the distribution of all the covariates. This is not practical and would limit my choice of covariates. Instead, I draw a sample of 20 million observations, compute \(\frac{1}{N}\sum_{i=1}^N \phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}\), and take it to be the true value. I follow the same logic for the other marginal effects.
Below is the code I use to compute the approximate true marginal effects. I draw the 20 million observations, compute the averages that I wil use in my simulation, and create locals for each approximate true value.
. mkdata, n(`L') (2 missing values generated) . local values "m1 m2 mean1 mean2 m1p m2p mean1p mean2p" . local means "mx1 mx2 meanx1 meanx2 mx1p mx2p meanx1p meanx2p" . local n : word count `values' . . forvalues i= 1/`n' { 2. local a: word `i' of `values' 3. local b: word `i' of `means' 4. sum `a', meanonly 5. local `b' = r(mean) 6. }
Now, I am ready to run all the simulations that I used to produce the results in the previous section. The code that I used for the simulations for the TEM and the MEM when the true DGP is a logit is given by:
. postfile lpm y1l y1l_r y1lp y1lp_r y2l y2l_r y2lp y2lp_r /// > using simslpm, replace . forvalues i=1/`R' { 2. quietly { 3. mkdata, n(`N') 4. logit y x1 i.x2, vce(robust) 5. margins, dydx(*) atmeans post vce(unconditional) 6. local y1l = _b[x1] 7. test _b[x1] = `meanx1' 8. local y1l_r = (r(p)<.05) 9. local y2l = _b[1.x2] 10. test _b[1.x2] = `meanx2' 11. local y2l_r = (r(p)<.05) 12. regress y x1 i.x2, vce(robust) 13. margins, dydx(*) atmeans post vce(unconditional) 14. local y1lp = _b[x1] 15. test _b[x1] = `meanx1' 16. local y1lp_r = (r(p)<.05) 17. local y2lp = _b[1.x2] 18. test _b[1.x2] = `meanx2' 19. local y2lp_r = (r(p)<.05) 20. post lpm (`y1l') (`y1l_r') (`y1lp') (`y1lp_r') /// > (`y2l') (`y2l_r') (`y2lp') (`y2lp_r') 21. } 22. } . postclose lpm . use simslpm, clear . sum Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- y1l | 5,000 -.0985646 .00288 -.1083639 -.0889075 y1l_r | 5,000 .0544 .226828 0 1 y1lp | 5,000 -.0939211 .0020038 -.1008612 -.0868043 y1lp_r | 5,000 .6182 .4858765 0 1 y2l | 5,000 .1084959 .065586 -.1065291 .3743112 -------------+--------------------------------------------------------- y2l_r | 5,000 .0618 .240816 0 1 y2lp | 5,000 .0915894 .055462 -.0975456 .3184061 y2lp_r | 5,000 .0732 .2604906 0 1
For the results for the AME and the ATE when the true DGP is a logit, I use margins without the atmeans option. The other cases are similar. I use robust standard errors for all computations because my likelihood model is an approximation to the true likelihood, and I use the option vce(unconditional) to account for the fact that I am using two-step M-estimation. See Wooldridge (2010) for more details on two-step M-estimation.
You can obtain the code used to produce these results here.
Conclusion
Using a probit or a logit model yields equivalent marginal effects. I provide evidence that the same cannot be said of the marginal effect estimates of the linear probability model when compared with those of the logit and probit models.
Acknowledgment
This post was inspired by a question posed by Stephen Jenkins after my previous post.
Reference
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.
]]>This command builds on several previous posts; at a minimum, you should be familiar with Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: Computing OLS objects in Mata.
An OLS command with Mata computations
The Stata command myregress11 computes the results in Mata. The syntax of the myregress11 command is
myregress11 depvar [indepvars] [if] [in] [, noconstant]
where indepvars can contain factor variables or time-series variables.
In the remainder of this post, I discuss the code for myregress11.ado. I recommend that you click on the file name to download the code. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.
I do not discuss the formulas for the OLS objects. See Programming an estimation command in Stata: Computing OLS objects in Mata for the formulas and Mata implementations.
*! version 11.0.0 11Jan2016 program define myregress11, eclass sortpreserve version 14 syntax varlist(numeric ts fv) [if] [in] [, noCONStant ] marksample touse gettoken depvar indepvars : varlist _fv_check_depvar `depvar' fvexpand `indepvars' local cnames `r(varlist)' tempname b V N rank df_r mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'", /// "`b'", "`V'", "`N'", "`rank'", "`df_r'") if "`constant'" == "" { local cnames `cnames' _cons } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn scalar df_r = `df_r' ereturn local cmd "myregress11" ereturn display end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar bname, string scalar Vname, string scalar nname, string scalar rname, string scalar dfrname) { real vector y, b, e, e2 real matrix X, XpXi real scalar n, k y = st_data(., depvar, touse) X = st_data(., indepvars, touse) n = rows(X) if (constant == "") { X = X,J(n,1,1) } XpXi = quadcross(X, X) XpXi = invsym(XpXi) b = XpXi*quadcross(X, y) e = y - X*b e2 = e:^2 k = cols(X) - diag0cnt(XpXi) V = (quadsum(e2)/(n-k))*XpXi st_matrix(bname, b') st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, k) st_numscalar(dfrname, n-k) } end
Let’s break this 74-line program into familiar pieces to make it easier to understand. Lines 2–35 define the ado-command, and lines 37–74 define the Mata work function that is used by the ado-command. Although there are more details, I used this structure in mymean8.ado, which I discussed in Programming an estimation command in Stata: A first ado-command using Mata.
The ado-command has four parts.
The Mata function mywork() also has four parts.
Now let’s discuss the ado-code in some detail. Line 5 uses the syntax command to put the names of the variables specified by the user into the local macro varlist, to parse an optional if restriction, to parse an optional in restriction, and to parse the noconstant option. The variables specified by the user must be numeric, may contain time-series variables, and may contain factor variables. For more detailed discussions of similar syntax usages, see Programming an estimation command in Stata: Allowing for sample restrictions and factor variables, Programming an estimation command in Stata: Allowing for options, and Programming an estimation command in Stata: Using a subroutine to parse a complex option.
Line 6 creates a temporary sample-identification variable and stores its name in the local macro touse. I discussed this usage in detail in the section surrounding code block 1 in Programming an estimation command in Stata: Allowing for sample restrictions and factor variables.
Line 8 uses gettoken to store the name of the dependent variable in the local macro depvar and the names of the independent variables in the local macro indepvars. Line 9 uses _fv_check_depvar to check that the name of the dependent variable is not a factor variable.
Line 11 uses fvexpand to expand the factor variables in indepvars. Line 12 puts the expanded names stored in r(varlist) by fvexpand in the local macro cnames. A single factor variable can imply more than one coefficient. fvexpand finds the canonical names for these coefficients and returns them in r(varlist). I have not used fvexpand until now because the Stata commands that I used to compute the results automatically created the coefficient names. Mata functions are designed for speed, so I must create the coefficient names when I use them.
Example 1 illustrates how one factor variable can imply more than one coefficient.
Example 1: fvexpand
. sysuse auto (1978 Automobile Data) . tabulate rep78 Repair | Record 1978 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00 . fvexpand i.rep78 . return list macros: r(fvops) : "true" r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78" . summarize 2.rep78 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- 2.rep78 | 69 .115942 .3225009 0 1
The tabulate results show that there are five levels in rep78. fvexpand finds the levels and creates a list of the names of the implied indicator variables 1b.rep78, 2.rep78, 3.rep78, 4.rep78, and 5.rep78. Comparing the results from summarize 2.rep78 and tabulate rep78 illustrates this notation. The b in 1b.rep78 identifies level 1 as the base category to be omitted when there is a constant in the model. Type help fvvarlist for more details.
Line 14 creates the temporary names for the results. For example, it stores a safe, temporary name in the local macro b that can be used for the matrix storing the point estimates. I discussed this usage in the section Using temporary names for global objects in Programming an estimation command in Stata: A first ado-command using Mata.
Lines 16 and 17 call the Mata function mywork(), which uses the information contained in the local macros depvar, cnames, touse, and constant to compute the results that are returned in the Stata objects whose names are stored in the local macros b, V, N, rank, and df_r.
Line 20 appends _cons to the local macro cnames, if the user specified the option noconstant.
Lines 23–25 put row names on the vector of point estimates and row and column names on the matrix containing the estimated variance-covariance of the estimator (VCE).
Lines 27–31 post the results to e().
Line 33 displays a standard Stata output table, using the results in e(b), e(V), and e(df_r).
Note that the local macro b created on line 14 contains a temporary name that is passed to mywork() on line 17 and that the Stata matrix whose name is contained in the local macro b is used on lines 23 and 27. mywork() puts the vector of point estimates in the Stata matrix whose name is stored in the local macro b. Also note that the local macro V created on line 14 contains a temporary name that is passed to mywork() on line 17 and that the Stata matrix whose name is contained in the local macro V is used on lines 24, 25, and 27. mywork() puts the estimated VCE in the Stata matrix whose name is stored in the local macro V.
To see how this works, let’s discuss the mywork() function in detail. Lines 39–43 declare that mywork() returns nothing, it is void, and declare that mywork() accepts nine arguments, each of which is a string scalar. The first four arguments are inputs; depvar contains the name of the independent variable, indepvars contains the names of the independent variables, touse contains the name of the sample-identification variable, and constant contains either noconstant or is empty. The values of these arguments are used on lines 50, 51, and 54 to create the vector y and the matrix X.
The last five arguments contain names used to write the results back to Stata. mywork() writes the results back to Stata using the passed-in temporary names. For example, line 17 shows that the Mata string scalar bname contains the temporary name stored in the local macro b. Line 66 copies the results stored in the transpose of the Mata vector b to a Stata matrix whose name is stored in the Mata string scalar bname. (Line 60 shows that the vector b contains the OLS point estimates.) Lines 23 and 27 then use this Stata vector whose name is stored in the local macro b. Similarly, line 17 shows that the Mata string scalar Vname contains the temporary name stored in the local macro V. Line 67 copies the results stored in the Mata matrix V to a Stata matrix whose name is stored in the Mata string scalar Vname. (Line 64 shows that V contains the estimated VCE.) Lines 24, 25, and 27 then use this Stata matrix whose name is stored in the local macro V. The arguments nname, rname, and dfrname are used analogously to return the results for the number of observations, the rank of the VCE, and the degrees of freedom of the residuals.
Lines 50–64 compute the point estimates and the VCE. Except for line 54, I discussed these computations in Programming an estimation command in Stata: A first ado-command using Mata. Line 54 causes a column of 1s to be joined to the covariate matrix X when the string scalar constant is empty. Lines 5 and 16 imply that the Mata string scalar constant contains noconstant when the user specifies the noconstant option and that it is empty otherwise.
Done and undone
I discussed the code for myregress11.ado, which uses Mata to compute OLS point estimates and a VCE that assumes independent and identically distributed observations. The structure of the code is the same as the one that I used in mymean7.ado and mymean8.ado, discussed in Programming an estimation command in Stata: A first ado-command using Mata, although there are more details in the OLS program.
Key to this structure is that the Mata work function accepts two types of arguments: the names of Stata objects that are inputs and temporary names that are used to write the results back to Stata from Mata.
In the next post, I extend myregress11.ado to allow for robust or cluster-robust estimators of the VCE.
]]>I show that the estimates from a probit and logit model are similar for the computation of a set of effects that are of interest to researchers. I focus on the effects of changes in the covariates on the probability of a positive outcome for continuous and discrete covariates. I evaluate these effects on average and at the mean value of the covariates. In other words, I study the average marginal effects (AME), the average treatment effects (ATE), the marginal effects at the mean values of the covariates (MEM), and the treatment effects at the mean values of the covariates (TEM).
First, I present the results. Second, I discuss the code used for the simulations.
Results
In Table 1, I present the results of a simulation with 4,000 replications when the true data generating process (DGP) satisfies the assumptions of a probit model. I show the average of the AME and the ATE estimates and the 5% rejection rate of the true null hypothesis that arise after probit and logit estimation. I also provide an approximate true value of the AME and ATE. I obtain the approximate true values by computing the ATE and AME, at the true values of the coefficients, using a sample of 20 million observations. I will provide more details on the simulation in a later section.
Table 1: Average Marginal and Treatment Effects: True DGP Probit
Statistic | Approximate True Value | Probit | Logit |
---|---|---|---|
AME of x1 | -.1536 | -.1537 | -.1537 |
5% Rejection Rate | .050 | .052 | |
ATE of x2 | .1418 | .1417 | .1417 |
5% Rejection Rate | .050 | .049 |
For the MEM and TEM, we have the following:
Table 2: Marginal and Treatment Effects at Mean Values: True DGP Probit
Statistic | Approximate True Value | Probit | Logit |
---|---|---|---|
MEM of x1 | -.1672 | -.1673 | -.1665 |
5% Rejection Rate | .056 | .06 | |
TEM of x2 | .1499 | .1498 | .1471 |
5% Rejection Rate | .053 | .058 |
The logit estimates are close to the true value and have a rejection rate that is close to 5%. Fitting the parameters of our model using logit when the true DGP satisfies the assumptions of a probit model does not lead us astray.
If the true DGP satisfies the assumptions of the logit model, the conclusions are the same. I present the results in the next two tables.
Table 3: Average Marginal and Treatment Effects: True DGP Logit
Statistic | Approximate True Value | Probit | Logit |
---|---|---|---|
AME of x1 | -.1090 | -.1088 | -.1089 |
5% Rejection Rate | .052 | .052 | |
ATE of x2 | .1046 | .1044 | .1045 |
5% Rejection Rate | .053 | .051 |
Table 4: Marginal and Treatment Effects at Mean Values: True DGP Logit
Statistic | Approximate True Value | Probit | Logit |
---|---|---|---|
MEM of x1 | -.1146 | -.1138 | -.1146 |
5% Rejection Rate | .050 | .051 | |
TEM of x2 | .1086 | .1081 | .1085 |
5% Rejection Rate | .058 | .058 |
Why?
Maximum likelihood estimators find the parameters that maximize the likelihood that our data will fit the distributional assumptions that we make. The likelihood chosen is an approximation to the true likelihood, and it is a helpful approximation if the true likelihood and our approximating are close to each other. Viewing likelihood-based models as useful approximations, instead of as models of a true likelihood, is the basis of quasilikelihood theory. For more details, see White (1996) and Wooldridge (2010).
It is assumed that the unobservable random variable in the probit model and logit model comes from a standard normal and logistic distribution, respectively. The cumulative distribution functions (CDFs) in these two cases are close to each other, especially around the mean. Therefore, estimators under these two sets of assumptions produce similar results. To illustrate these arguments, we can plot the two CDFs and their differences as follows:
Graph 1: Normal and Logistic CDF’s and their Difference
The difference between the CDFs approaches zero as you get closer to the mean, from the right or from the left, and it is always smaller than .15.
Simulation design
Below is the code I used to generate the data for my simulations. In the first part, lines 4 to 12, I generate outcome variables that satisfy the assumptions of the probit model, y1, and the logit model, y2. In the second part, lines 13 to 16, I compute the marginal effects for the logit and probit models. I have a continuous and a discrete covariate. For the discrete covariate, the marginal effect is a treatment effect. In the third part, lines 17 to 25, I compute the marginal effects evaluated at the means. I will use these estimates later to compute approximations to the true values of the effects.
program define mkdata syntax, [n(integer 1000)] clear quietly set obs `n' // 1. Generating data from probit, logit, and misspecified generate x1 = rnormal() generate x2 = rbeta(2,4)>.5 generate e1 = rnormal() generate u = runiform() generate e2 = ln(u) -ln(1-u) generate xb = .5*(1 -x1 + x2) generate y1 = xb + e1 > 0 generate y2 = xb + e2 > 0 // 2. Computing probit & logit marginal and treatment effects generate m1 = normalden(xb)*(-.5) generate m2 = normal(1 -.5*x1 ) - normal(.5 -.5*x1) generate m1l = exp(xb)*(-.5)/(1+exp(xb))^2 generate m2l = exp(1 -.5*x1)/(1+ exp(1 -.5*x1 )) - /// exp(.5 -.5*x1)/(1+ exp(.5 -.5*x1 )) // 3. Computing probit & logit marginal and treatment effects at means quietly mean x1 x2 matrix A = r(table) scalar a = .5 -.5*A[1,1] + .5*A[1,2] scalar b1 = 1 -.5*A[1,1] scalar b0 = .5 -.5*A[1,1] generate mean1 = normalden(a)*(-.5) generate mean2 = normal(b1) - normal(b0) generate mean1l = exp(a)*(-.5)/(1+exp(a))^2 generate mean2l = exp(b1)/(1+ exp(b1)) - exp(b0)/(1+ exp(b0)) end
I approximate the true marginal effects using a sample of 20 million observations. This is a reasonable strategy in this case. For example, take the average marginal effect for a continuous covariate, \(x_{k}\), in the case of the probit model:
\[\begin{equation*}
\frac{1}{N}\sum_{i=1}^N \phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}
\end{equation*}\]
The expression above is an approximation to \(E\left(\phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}\right)\). To obtain this expected value, we would need to integrate over the distribution of all the covariates. This is not practical and would limit my choice of covariates. Instead, I draw a sample of 20 million observations, compute \(\frac{1}{N}\sum_{i=1}^N \phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}\), and take it to be the true value. I follow the same logic for the other marginal effects.
Below is the code I use to compute the approximate true marginal effects. I draw the 20 million observations, then I compute the averages that I am going to use in my simulation, and I create locals for each approximate true value.
. mkdata, n(20000000) . local values "m1 m2 m1l m2l mean1 mean2 mean1l mean2l" . local means "mx1 mx2 mx1l mx2l meanx1 meanx2 meanx1l meanx2l" . local n : word count `values' . forvalues i= 1/`n' { 2. local a: word `i' of `values' 3. local b: word `i' of `means' 4. sum `a', meanonly 5. local `b' = r(mean) 6. }
Now I am ready to run all the simulations that I used to produce the results in the previous section. The code that I used for the simulations for the ATE and the AME when the true DGP is a probit is given by
. postfile mprobit y1p y1p_r y1l y1l_r y2p y2p_r y2l y2l_r /// > using simsmprobit, replace . forvalues i=1/4000 { 2. quietly { 3. mkdata, n(10000) 4. probit y1 x1 i.x2, vce(robust) 5. margins, dydx(*) atmeans post 6. local y1p = _b[x1] 7. test _b[x1] = `meanx1' 8. local y1p_r = (r(p)<.05) 9. local y2p = _b[1.x2] 10. test _b[1.x2] = `meanx2' 11. local y2p_r = (r(p)<.05) 12. logit y1 x1 i.x2, vce(robust) 13. margins, dydx(*) atmeans post 14. local y1l = _b[x1] 15. test _b[x1] = `meanx1' 16. local y1l_r = (r(p)<.05) 17. local y2l = _b[1.x2] 18. test _b[1.x2] = `meanx2' 19. local y2l_r = (r(p)<.05) 20. post mprobit (`y1p') (`y1p_r') (`y1l') (`y1l_r') /// > (`y2p') (`y2p_r') (`y2l') (`y2l_r') 21. } 22. }
. use simsprobit . summarize Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- y1p | 4,000 -.1536812 .0038952 -.1697037 -.1396532 y1p_r | 4,000 .05 .2179722 0 1 y1l | 4,000 -.1536778 .0039179 -.1692524 -.1396366 y1l_r | 4,000 .05175 .2215496 0 1 y2p | 4,000 .141708 .0097155 .1111133 .1800973 -------------+--------------------------------------------------------- y2p_r | 4,000 .0495 .2169367 0 1 y2l | 4,000 .1416983 .0097459 .1102069 .1789895 y2l_r | 4,000 .049 .215895 0 1
For the results in the case of the MEM and the TEM when the true DGP is a probit, I use margins with the option atmeans. The other cases are similar. I use robust standard error for all computations to account for the fact that my likelihood model is an approximation to the true likelihood, and I use the option vce(unconditional) to account for the fact that I am using two-step M-estimation. See Wooldridge (2010) for more details on two-step M-estimation.
You can download the code used to produce the results by clicking this link: pvsl.do
Concluding Remarks
I provided simulation evidence that illustrates that the differences between using estimates of effects after probit or logit is negligible. The reason lies in the theory of quasilikelihood and, specifically, in that the cumulative distribution functions of the probit and logit models are similar, especially around the mean.
References
White, H. 1996. Estimation, Inference, and Specification Analysis>. Cambridge: Cambridge University Press.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.
]]>