## Flexible discrete choice modeling using a multinomial probit model, part 1

\(\newcommand{\xb}{{\bf x}}

\newcommand{\betab}{\boldsymbol{\beta}}

\newcommand{\zb}{{\bf z}}

\newcommand{\gammab}{\boldsymbol{\gamma}}\)**We have no choice but to choose**

We make choices every day, and often these choices are made among a finite number of potential alternatives. For example, do we take the car or ride a bike to get to work? Will we have dinner at home or eat out, and if we eat out, where do we go? Scientists, marketing analysts, or political consultants, to name a few, wish to find out why people choose what they choose.

In this post, I provide some background about discrete choice models, specifically, the multinomial probit model. I discuss this model from a random utility model perspective and show you how to simulate data from it. This is helpful for understanding the underpinnings of this model. In my next post, we will use the simulated data to demonstrate how to estimate and interpret effects of interest.

**Random utility model and discrete choice**

A person confronted with a discrete set of alternatives is assumed to choose the alternative that maximizes his or her utility in some defined way. Utilities are typically conceived of as the result of a function that consists of an observed deterministic and an unobserved random part, because not all factors that may be relevant for a given decision can be observed. The frequently used linear random utility model is

\[U_{ij} = V_{ij} + \epsilon_{ij}, \hspace{5mm} j = 1,…,J\]

where \(U_{ij}\) is the utility of the \(i\)th individual related to the \(j

\)th alternative, \(V_{ij}\) is the observed component, and \(\epsilon_{ij}\) is the unobserved component. In the context of regression modeling, the observed part, \(V_{ij}\), is typically construed as some linear or nonlinear combination of observed characteristics related to individuals and alternatives and corresponding parameter estimates, while the parameters are estimated based on a model that makes certain assumptions about the distribution of the unobserved components, \(\epsilon_{ij}\).

**Motivating example**

Let’s take a look at an example. Suppose that individuals can enroll in one of three health insurance plans: Sickmaster, Allgood, and Cowboy Health. Thus we have the following set of alternatives:

\[s=\{\mathrm{Sickmaster},\mathrm{Allgood},\mathrm{Cowboy\, Health}\}\]

We would expect a person’s utility related to each of the three alternatives to be a function of both personal characteristics (such as income or age) and characteristics of the health care plan (such as its price).

We might sample individuals and ask them which health plan they would prefer if they had to enroll in one of them. If we collected data on the person’s age (in decades), the person’s household income (in $10,000), and the price of a plan (in $100/month), then our data might look something like the first three cases from the simulated data below:

. list in 1/9, sepby(id) +-----------------------------------------------------------+ | id alt choice hhinc age price U | |-----------------------------------------------------------| 1. | 1 Sickmaster 1 3.66 2.1 2.05 2.38 | 2. | 1 Allgood 0 3.66 2.1 1.73 -1.04 | 3. | 1 Cowboy Health 0 3.66 2.1 1.07 -2.61 | |-----------------------------------------------------------| 4. | 2 Sickmaster 0 3.75 4.2 2.19 -2.97 | 5. | 2 Allgood 1 3.75 4.2 1.12 0.29 | 6. | 2 Cowboy Health 0 3.75 4.2 0.78 -2.22 | |-----------------------------------------------------------| 7. | 3 Sickmaster 0 2.32 2.4 2.25 -4.49 | 8. | 3 Allgood 0 2.32 2.4 1.31 -5.76 | 9. | 3 Cowboy Health 1 2.32 2.4 1.02 1.19 | +-----------------------------------------------------------+

Taking the first case (**id==1**), we see that the case-specific variables **hhinc** and **age** are constant across alternatives and that the alternative-specific variable **price** varies over alternatives.

The variable **alt** labels the alternatives, and the binary variable **choice** indicates the chosen alternative (it is coded 1 for the chosen plan, and 0 otherwise). Because this is a simulated dataset, we know the underlying utilties that correspond to each alternative, and those are given in variable **U**. The first respondent’s utility is highest for the first alternative, and so the outcome variable **choice** takes the value 1 for **alt==”Sickmaster”** and 0 otherwise. This is the marginal distribution of cases over alternatives:

. tabulate alt if choice == 1 Insurance | plan | Freq. Percent Cum. --------------+----------------------------------- Sickmaster | 6,315 31.57 31.57 Allgood | 8,308 41.54 73.11 Cowboy Health | 5,377 26.89 100.00 --------------+----------------------------------- Total | 20,000 100.00

As we will see below, a useful model for analyzing these types of data is the multinomial probit model.

**Multinomial probit model**

The multinomial probit model is a discrete choice model that is based on the assumption that the unobserved components in \(\epsilon_{ij}\) come from a normal distribution. Different probit models arise from different specifications of \(V_{ij}\) and different assumptions about \(\epsilon_{ij}\). For example, with a basic multinomial probit model, as is implemented in Stata’s **mprobit** command (see **[R] mprobit**), we specify \(V_{ij}\) to be

\[V_{ij} = \xb_{i}\betab_{j}^{\,’}\]

where \(\xb_{i}\) is a vector of individual-specific covariates, and \(\betab_{j}\) is the corresponding parameter vector for alternative \(j\). The random components \(\epsilon_{ij}\) are assumed to come from a multivariate normal distribution with mean zero and identity variance–covariance matrix. For example, if we had three alternatives, we would assume

\begin{equation*}

\epsilon_{ij} {\raise.17ex\hbox{$\scriptstyle\sim$}} \mathcal{N}(0,\Sigma) , \hspace{5mm}

\Sigma =

\begin{bmatrix}

1 & 0 & 0 \\

& 1 & 0 \\

& & 1 \\

\end{bmatrix}

\end{equation*}

Specifying the above covariance structure means that the unobserved components, \(\epsilon_{ij}\), are assumed to be homoskedastic and independent across alternatives.

Independence implies that differences in utility between any two alternatives depend on these two alternatives but not on any of the other alternatives. This property is known as the independence from irrelevant alternatives (IIA) assumption. When the IIA assumption holds, it can lead to a number of convenient advantages such as studying only a subset of alternatives (see Train [2009, 48]). However, IIA is a fairly restrictive assumption that might not hold.

Continuing with our health care plan example, suppose that **Sickmaster** and **Allgood** both favor people with health problems, while **Cowboy Health** favors people who only rarely see a doctor. In this case, we would expect the utilities that correspond to alternatives **Sickmaster** and **Allgood** to be positively correlated while being negatively correlated with the utility corresponding to **Cowboy Health**. In other words, utilities with respect to alternatives **Sickmaster** and **Allgood** are related to those of **Cowboy Health**. In this case, we must use a model that relaxes the IIA assumption and allows for correlated utilities across alternatives.

Another potential limitation of our multinomial probit specification concerns the observed \(V_{ij}\), which consists of the linear combination of individual-specific variables and alternative-specific parameters. In other words, we only consider observed variables that vary over persons but not over alternatives. In a setting like this, we would use

\[V_{ij} = \xb_{i}\betab_{j}’ + \zb_{ij}\gammab’\]

where \(\zb_{ij}\) are alternative-specific variables that vary both over individuals and alternatives and \(\gammab\) is the corresponding parameter vector. Combining this with our more flexible assumptions about the unobservables, we can write our model as

\[U_{ij} = \xb_{i}\betab_{j}’ + \zb_{ij}\gammab’ + \epsilon_{ij}, \hspace{5mm} j = 1,…,J\]

with \(\epsilon_{ij} {\raise.17ex\hbox{\(\scriptstyle\sim\)}} \mathcal{N}(0,\Sigma)\).

Assuming unstructured correlation and heteroskedastic errors across \(J=3\) alternatives, for example, \(\Sigma\) is given by

\begin{equation*}

\Sigma =

\begin{bmatrix}

\sigma_{11} & \sigma_{12} & \sigma_{13} \\

& \sigma_{22} & \sigma_{23} \\

& & \sigma_{33} \\

\end{bmatrix}

\end{equation*}

As we will see later, we can fit this model in Stata with the **asmprobit** command; see **[R] asmprobit** for details about the command and implemented methods.

We said in our health plan example that we think that the price that individual \(i\) has to pay for the plan is important and it could vary both over individuals and alternatives. We can therefore write our utility model for three alternatives as

\[U_{ij} = \beta_{j,\mathtt{cons}} + \beta_{j,\mathtt{hhinc}}{\tt hhinc}_{i} + \beta_{j,\mathtt{age}}{\tt age}_{i} + \gamma {\tt price}_{ij} + \epsilon_{ij}, \hspace{5mm} j = 1,2,3\]

**Simulation**

We can simulate data assuming the data-generating process given in the above model. We will specify the two case-specific variables, household income (**hhinc**) and age (**age**), and we will take the price of the plan (**price**) as the alternative-specific variable. The case-specific variables **hhinc** and **age** will be constant across alternatives within each individual, while the alternative-specific variable **price** will vary over individuals and within individuals over alternatives.

We specify the following population parameters for \(\betab_{j}\) and \(\gamma\):

\begin{align*}

\beta_{1,\mathtt{cons}} &= -1, &\beta_{1,\mathtt{hhinc}} &= \hspace{2.7mm} 1, &\beta_{1,\mathtt{age}} &= -1 \\

\beta_{2,\mathtt{cons}} &= -6, &\beta_{2,\mathtt{hhinc}} &= \hspace{2.7mm} 0.5, &\beta_{2,\mathtt{age}} &= \hspace{2.7mm} 1 \\

\beta_{3,\mathtt{cons}} &= \hspace{2.7mm} 2, &\beta_{3,\mathtt{hhinc}} &= -1, &\beta_{3,\mathtt{age}} &= \hspace{2.7mm} 0.5 \\

\gamma &= -0.5 \\

\end{align*}

For \(\epsilon_{ij}\), we will specify the following:

\begin{equation*}

\epsilon_{ij} {\raise.17ex\hbox{$\scriptstyle\sim$}} \mathcal{N}(0,\Sigma), \hspace{5mm}

\Sigma =

\begin{bmatrix}

2.1 & 0.6 & -0.5 \\

& 1.7 & -0.8 \\

& & 1.4 \\

\end{bmatrix}

\end{equation*}

With these specifications, we can now create a simulated dataset. We start by drawing our three error terms and two case-specific covariates:

. clear . set seed 65482 . set obs 20000 number of observations (_N) was 0, now 20,000 . generate id = _n . scalar s11 = 2.1 . scalar s22 = 1.7 . scalar s33 = 1.4 . scalar s12 = 0.6 . scalar s13 = -0.5 . scalar s23 = -0.8 . mat C = (s11,s12,s13)\ > (s12,s22,s23)\ > (s13,s23,s33) . drawnorm e1 e2 e3, cov(C) . generate double hhinc = max(0,rnormal(5,1.5)) . generate double age = runiformint(20,60)/10

To allow for alternative specific covariates, we will **expand** our data so that we will have one observation for each alternative for each case, then create an index for the alternatives, and then generate our variables \({\tt price}_{ij}\):

. expand 3 (40,000 observations created) . bysort id : gen alt = _n . generate double price = rbeta(2,2) + 1.50 if alt == 1 (40,000 missing values generated) . replace price = rbeta(2,2) + 0.75 if alt == 2 (20,000 real changes made) . replace price = rbeta(2,2) + 0.25 if alt == 3 (20,000 real changes made)

We can now go ahead and generate three variables for the observed utility components, one for each alternative:

. generate double xb1 = -1.0 + 1.0*hhinc - 1.0*age - 0.5*price . generate double xb2 = -6.0 + 0.5*hhinc + 1.0*age - 0.5*price . generate double xb3 = 2.0 - 1.0*hhinc + 0.5*age - 0.5*price

To calculate the utilities that correspond to each alternative, we add the unobserved to the observed components:

. local snorm = sqrt((s11 + s22 - 2*s12)/2) . generate double U1 = xb1*`snorm' + e1 . generate double U2 = xb2*`snorm' + e2 . generate double U3 = xb3*`snorm' + e3

Looking at the code above, you will notice that we included a factor to scale our specified population parameters. This is due to identification details related to our model that I explain further in the Identification section. One thing we need to know now, however, is that for the model to be identified, the utilities need to be normalized for level and scale. Normalizing for level is straightforward because, since we are only interested in the utilities relative to each other, we can define a base-level alternative and then take the differences of utilities with respect to the set base. If we set the first alternative as the base, we can rewrite our model as follows:

\begin{align*}U^{*}_{ij} &= \beta_{j,\mathtt{cons}}-\beta_{1,\mathtt{cons}} + (\beta_{j,\mathtt{hhinc}}-\beta_{1,\mathtt{hhinc}}){\tt hhinc}_{i} +

(\beta_{j,\mathtt{age}}-\beta_{1,\mathtt{age}}){\tt age}_{i} \\

&\quad +

\gamma ({\tt price}_{ij}-{\tt price}_{i1}) + \epsilon_{ij}-\epsilon_{i1}, \hspace{5mm} j = 2,3

\end{align*}

This implies that only \(J-1\) parameter vectors in \(\betab\) are identified. Let’s define these parameters as

\begin{align*}

\Delta \beta_{j,\mathtt{cons}} &= \beta_{j,\mathtt{cons}}-\beta_{1,\mathtt{cons}} \\

\Delta \beta_{j,\mathtt{hhinc}} &= \beta_{j,\mathtt{hhinc}}-\beta_{1,\mathtt{hhinc}} \\

\Delta \beta_{j,\mathtt{age}} &= \beta_{j,\mathtt{age}}-\beta_{1,\mathtt{age}} \\

\end{align*}

for \(j = 2,3\). The parameters in \(\betab_{j}\) that we will try to recover will then be the following differences:

\begin{align*}

\Delta \beta_{2,\mathtt{cons}} &= -5 \\

\Delta \beta_{3,\mathtt{cons}} &= \hspace{2.7mm} 3 \\

\Delta \beta_{2,\mathtt{hhinc}} &= -0.5 \\

\Delta \beta_{3,\mathtt{hhinc}} &= -2 \\

\Delta \beta_{2,\mathtt{age}} &= \hspace{2.7mm} 2 \\

\Delta \beta_{3,\mathtt{age}} &= \hspace{2.7mm} 1.5 \\

\end{align*}

What is left to complete our simulated dataset is to generate the outcome variable that takes the value 1 if observation \(i\) chooses alternative \(k\), and 0 otherwise. To do this, we will first create a single variable for the utilities and then determine the alternative with the highest utility:

. quietly generate double U = . . quietly generate y = . . forval i = 1/3 { 2. quietly replace U = U`i' if alt==`i' 3. } . bysort id : egen double umax_i = max(U) . forval i = 1/3 { 2. quietly bysort id : replace y = alt if umax_i == U 3. } . generate choice = alt == y

We obtain the following by using **asmprobit**:

. asmprobit choice price, case(id) alternatives(alt) casevars(hhinc age) > basealternative(1) scalealternative(2) nolog Alternative-specific multinomial probit Number of obs = 60,000 Case variable: id Number of cases = 20,000 Alternative variable: alt Alts per case: min = 3 avg = 3.0 max = 3 Integration sequence: Hammersley Integration points: 150 Wald chi2(5) = 4577.15 Log simulated-likelihood = -11219.181 Prob > chi2 = 0.0000 ---------------------------------------------------------------------------- choice | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+-------------------------------------------------------------- alt | price | -.4896106 .0523626 -9.35 0.000 -.5922394 -.3869818 -------------+--------------------------------------------------------------- Sickmaster | (base alternative) -------------+-------------------------------------------------------------- Allgood | hhinc | -.5006212 .0302981 -16.52 0.000 -.5600043 -.441238 age | 2.001367 .0306663 65.26 0.000 1.941262 2.061472 _cons | -4.980841 .1968765 -25.30 0.000 -5.366711 -4.59497 -------------+-------------------------------------------------------------- Cowboy_Hea~h | hhinc | -1.991202 .1092118 -18.23 0.000 -2.205253 -1.77715 age | 1.494056 .0446662 33.45 0.000 1.406512 1.581601 _cons | 3.038869 .4066901 7.47 0.000 2.241771 3.835967 -------------+-------------------------------------------------------------- /lnl2_2 | .5550228 .0742726 7.47 0.000 .4094512 .7005944 -------------+-------------------------------------------------------------- /l2_1 | .667308 .1175286 5.68 0.000 .4369562 .8976598 ---------------------------------------------------------------------------- (alt=Sickmaster is the alternative normalizing location) (alt=Allgood is the alternative normalizing scale)

Looking at the above output, we see that the coefficient of the alternative-specific variable **price** is \(\widehat \gamma = -0.49\), which is close to our specified population parameter of \(\gamma = -0.50\). We can say the same about our case-specific variables. The estimated coefficients of **hhinc** are \(\widehat \Delta \beta_{2,\mathtt{hhinc}} = -0.50\) for the second and \(\widehat \Delta \beta_{3,\mathtt{hhinc}} = -1.99\) for the third alternative. The estimates for **age** are \(\widehat\Delta \beta_{2,\mathtt{age}} = 2.00\) and \(\widehat \Delta \beta_{3,\mathtt{age}} = 1.49\). The estimated differences in alternative-specific constants are \(\widehat \Delta \beta_{2,\mathtt{cons}} = -4.98\) and \(\widehat \Delta \beta_{3,\mathtt{cons}} = 3.04\).

**Identification**

Now let me shed more light on the identification details related to our model that we needed to consider when we simulated our dataset. An important feature of \(U_{ij}\) is that the level as well as the scale of utility is irrelevant with respect to the chosen alternative because shifting the level by some constant amount, or multiplying it by a (positive) constant, does not change the rank order of utilities and thus would have no impact on the chosen alternative. This has important ramifications for modeling utilities because without a set level and scale of \(U_{ij}\), there are an infinite number of parameters in \(V_{ij}\) that yield the same outcome in terms of the chosen alternatives. Therefore, utilities need to be normalized to identify the parameters of the model.

We already saw how to normalize for level. Normalizing for scale is a bit more difficult, though, because we assume correlated and heteroskedastic errors. Because of the hetersokedasticity, we need to set the scale for one of the variances and then estimate the other variances in relation to the set variance. We must also account for the nonzero covariance between the errors, which makes additional identifying restrictions necessary. It turns out that given our model assumptions, only \(J(J-1)/2-1\) parameters of our variance–covariance matrix are identifiable (see chapter 5 in Train [2009] for details about identifying restrictions in the context of probit models). To be concrete, our original variance–covariance matrix was the following:

\begin{equation*}

\Sigma =

\begin{bmatrix}

\sigma_{11} & \sigma_{12} & \sigma_{13} \\

& \sigma_{22} & \sigma_{23} \\

& & \sigma_{33} \\

\end{bmatrix}

\end{equation*}

Taking differences of correlated errors reduces the \(3 \times 3\) matrix of error variances to a \(2 \times 2\) variance–covariance matrix of error differences:

\begin{equation*}

\Sigma^{*} =

\begin{bmatrix}

& \sigma_{11}+\sigma_{22}-2\sigma_{12} & \sigma_{11}+\sigma_{23}-\sigma_{12}-\sigma_{13} \\

& & \sigma_{11}+\sigma_{33}-2\sigma_{13} \\

\end{bmatrix}

\end{equation*}

If we normalize this matrix with respect to the second alternative, we get

\begin{equation*}

\widetilde \Sigma^{*} =

\begin{bmatrix}

& 1 & (\sigma_{11}+\sigma_{23}-\sigma_{12}-\sigma_{13})/\nu \\

& & (\sigma_{11}+\sigma_{33}-2\sigma_{13})/\nu \\

\end{bmatrix}

\end{equation*}

where \(\nu = \sigma_{11}+\sigma_{22}-2\sigma_{12}\). Because we also have to set the scale for our base alternative, our normalized matrix becomes

\begin{equation*}

\check \Sigma^{*} =

\begin{bmatrix}

& 2 & 2(\sigma_{11}+\sigma_{23}-\sigma_{12}-\sigma_{13})/\nu \\

& & 2(\sigma_{11}+\sigma_{33}-2\sigma_{13})/\nu \\

\end{bmatrix}

\end{equation*}

Thus, because utilities are scaled by the standard deviation, they are divided by \(\sqrt{\nu/2}\). Now, getting back to our simulation, if we wish to recover our specified parameters, we need to scale them accordingly. We start from the variance–covariance matrix of error differences:

\begin{equation*}

\Sigma^{*} =

\begin{bmatrix}

& 2.1 + 1.7 – 2*0.6 & 2.1 – 0.8 – 0.6 + 0.5 \\

& & 2.1 + 1.4 – 2*-0.5 \\

\end{bmatrix}

=

\begin{bmatrix}

& 2.6 & 1.2 \\

& & 4.5 \\

\end{bmatrix}

\end{equation*}

Normalizing with respect to the second alternative yields

\begin{equation*}

\widetilde \Sigma^{*} =

\begin{bmatrix}

& 1 & 1.2/2.6 \\

& & 4.5/2.6 \\

\end{bmatrix}

=

\begin{bmatrix}

& 1 & 0.4615 \\

& & 1.7308 \\

\end{bmatrix}

\end{equation*}

and then multiplying \(\tilde \Sigma^{*}\) by 2 yields

\begin{equation*}

\check \Sigma^{*} =

\begin{bmatrix}

& 2 & 0.9231 \\

& & 3.4615 \\

\end{bmatrix}

\end{equation*}

which are the true variance–covariance parameters. Our scaling term is \(\sqrt{2.6/2}\), and because utilities will be divided by this term, we will need to multiply our parameters by this term.

Finally, we check if we can recover our variance–covariance parameters. We use the postestimation command **estat covariance** to display the estimated variance–covariance matrix of error differences:

. estat covariance +-------------------------------------+ | | Allgood Cowboy_~h | |--------------+----------------------| | Allgood | 2 | | Cowboy_Hea~h | .943716 3.479797 | +-------------------------------------+ Note: Covariances are for alternatives differenced with Sickmaster.

We see that our estimate is close to the true normalized covariance matrix.

**Conclusion**

I discussed multinomial probit models in a discrete choice context and showed how to generate a simulated dataset accordingly. In my next post, we will use our simulated dataset and discuss estimation and interpretation of model results, which is not as straightforward as one might think.

**Reference**

Train, K. E. 2009. *Discrete Choice Methods with Simulation*. 2nd ed. New York: Cambridge University Press.