Home > Statistics > Flexible discrete choice modeling using a multinomial probit model, part 1

## Flexible discrete choice modeling using a multinomial probit model, part 1


We make choices every day, and often these choices are made among a finite number of potential alternatives. For example, do we take the car or ride a bike to get to work? Will we have dinner at home or eat out, and if we eat out, where do we go? Scientists, marketing analysts, or political consultants, to name a few, wish to find out why people choose what they choose.

In this post, I provide some background about discrete choice models, specifically, the multinomial probit model. I discuss this model from a random utility model perspective and show you how to simulate data from it. This is helpful for understanding the underpinnings of this model. In my next post, we will use the simulated data to demonstrate how to estimate and interpret effects of interest.

Random utility model and discrete choice

A person confronted with a discrete set of alternatives is assumed to choose the alternative that maximizes his or her utility in some defined way. Utilities are typically conceived of as the result of a function that consists of an observed deterministic and an unobserved random part, because not all factors that may be relevant for a given decision can be observed. The frequently used linear random utility model is

$U_{ij} = V_{ij} + \epsilon_{ij}, \hspace{5mm} j = 1,…,J$

where $$U_{ij}$$ is the utility of the $$i$$th individual related to the $$j$$th alternative, $$V_{ij}$$ is the observed component, and $$\epsilon_{ij}$$ is the unobserved component. In the context of regression modeling, the observed part, $$V_{ij}$$, is typically construed as some linear or nonlinear combination of observed characteristics related to individuals and alternatives and corresponding parameter estimates, while the parameters are estimated based on a model that makes certain assumptions about the distribution of the unobserved components, $$\epsilon_{ij}$$.

Motivating example

Let’s take a look at an example. Suppose that individuals can enroll in one of three health insurance plans: Sickmaster, Allgood, and Cowboy Health. Thus we have the following set of alternatives:

$s=\{\mathrm{Sickmaster},\mathrm{Allgood},\mathrm{Cowboy\, Health}\}$

We would expect a person’s utility related to each of the three alternatives to be a function of both personal characteristics (such as income or age) and characteristics of the health care plan (such as its price).

We might sample individuals and ask them which health plan they would prefer if they had to enroll in one of them. If we collected data on the person’s age (in decades), the person’s household income (in $10,000), and the price of a plan (in$100/month), then our data might look something like the first three cases from the simulated data below:

. list in 1/9, sepby(id)

+-----------------------------------------------------------+
| id             alt   choice   hhinc   age   price       U |
|-----------------------------------------------------------|
1. |  1      Sickmaster        1    3.66   2.1    2.05    2.38 |
2. |  1         Allgood        0    3.66   2.1    1.73   -1.04 |
3. |  1   Cowboy Health        0    3.66   2.1    1.07   -2.61 |
|-----------------------------------------------------------|
4. |  2      Sickmaster        0    3.75   4.2    2.19   -2.97 |
5. |  2         Allgood        1    3.75   4.2    1.12    0.29 |
6. |  2   Cowboy Health        0    3.75   4.2    0.78   -2.22 |
|-----------------------------------------------------------|
7. |  3      Sickmaster        0    2.32   2.4    2.25   -4.49 |
8. |  3         Allgood        0    2.32   2.4    1.31   -5.76 |
9. |  3   Cowboy Health        1    2.32   2.4    1.02    1.19 |
+-----------------------------------------------------------+


Taking the first case (id==1), we see that the case-specific variables hhinc and age are constant across alternatives and that the alternative-specific variable price varies over alternatives.

The variable alt labels the alternatives, and the binary variable choice indicates the chosen alternative (it is coded 1 for the chosen plan, and 0 otherwise). Because this is a simulated dataset, we know the underlying utilties that correspond to each alternative, and those are given in variable U. The first respondent’s utility is highest for the first alternative, and so the outcome variable choice takes the value 1 for alt==”Sickmaster” and 0 otherwise. This is the marginal distribution of cases over alternatives:

. tabulate alt if choice == 1

Insurance |
plan |      Freq.     Percent        Cum.
--------------+-----------------------------------
Sickmaster |      6,315       31.57       31.57
Allgood |      8,308       41.54       73.11
Cowboy Health |      5,377       26.89      100.00
--------------+-----------------------------------
Total |     20,000      100.00


As we will see below, a useful model for analyzing these types of data is the multinomial probit model.

Multinomial probit model

The multinomial probit model is a discrete choice model that is based on the assumption that the unobserved components in $$\epsilon_{ij}$$ come from a normal distribution. Different probit models arise from different specifications of $$V_{ij}$$ and different assumptions about $$\epsilon_{ij}$$. For example, with a basic multinomial probit model, as is implemented in Stata’s mprobit command (see [R] mprobit), we specify $$V_{ij}$$ to be

$V_{ij} = \xb_{i}\betab_{j}^{\,’}$

where $$\xb_{i}$$ is a vector of individual-specific covariates, and $$\betab_{j}$$ is the corresponding parameter vector for alternative $$j$$. The random components $$\epsilon_{ij}$$ are assumed to come from a multivariate normal distribution with mean zero and identity variance–covariance matrix. For example, if we had three alternatives, we would assume

\begin{equation*}
\epsilon_{ij} {\raise.17ex\hbox{$\scriptstyle\sim$}} \mathcal{N}(0,\Sigma) , \hspace{5mm}
\Sigma =
\begin{bmatrix}
1 & 0 & 0 \\
& 1 & 0 \\
& & 1 \\
\end{bmatrix}
\end{equation*}

Specifying the above covariance structure means that the unobserved components, $$\epsilon_{ij}$$, are assumed to be homoskedastic and independent across alternatives.

Independence implies that differences in utility between any two alternatives depend on these two alternatives but not on any of the other alternatives. This property is known as the independence from irrelevant alternatives (IIA) assumption. When the IIA assumption holds, it can lead to a number of convenient advantages such as studying only a subset of alternatives (see Train [2009, 48]). However, IIA is a fairly restrictive assumption that might not hold.

Continuing with our health care plan example, suppose that Sickmaster and Allgood both favor people with health problems, while Cowboy Health favors people who only rarely see a doctor. In this case, we would expect the utilities that correspond to alternatives Sickmaster and Allgood to be positively correlated while being negatively correlated with the utility corresponding to Cowboy Health. In other words, utilities with respect to alternatives Sickmaster and Allgood are related to those of Cowboy Health. In this case, we must use a model that relaxes the IIA assumption and allows for correlated utilities across alternatives.

Another potential limitation of our multinomial probit specification concerns the observed $$V_{ij}$$, which consists of the linear combination of individual-specific variables and alternative-specific parameters. In other words, we only consider observed variables that vary over persons but not over alternatives. In a setting like this, we would use

$V_{ij} = \xb_{i}\betab_{j}’ + \zb_{ij}\gammab’$

where $$\zb_{ij}$$ are alternative-specific variables that vary both over individuals and alternatives and $$\gammab$$ is the corresponding parameter vector. Combining this with our more flexible assumptions about the unobservables, we can write our model as

$U_{ij} = \xb_{i}\betab_{j}’ + \zb_{ij}\gammab’ + \epsilon_{ij}, \hspace{5mm} j = 1,…,J$

with $$\epsilon_{ij} {\raise.17ex\hbox{\(\scriptstyle\sim$$}} \mathcal{N}(0,\Sigma)\).

Assuming unstructured correlation and heteroskedastic errors across $$J=3$$ alternatives, for example, $$\Sigma$$ is given by

\begin{equation*}
\Sigma =
\begin{bmatrix}
\sigma_{11} & \sigma_{12} & \sigma_{13} \\
& \sigma_{22} & \sigma_{23} \\
& & \sigma_{33} \\
\end{bmatrix}
\end{equation*}

As we will see later, we can fit this model in Stata with the asmprobit command; see [R] asmprobit for details about the command and implemented methods.

We said in our health plan example that we think that the price that individual $$i$$ has to pay for the plan is important and it could vary both over individuals and alternatives. We can therefore write our utility model for three alternatives as

$U_{ij} = \beta_{j,\mathtt{cons}} + \beta_{j,\mathtt{hhinc}}{\tt hhinc}_{i} + \beta_{j,\mathtt{age}}{\tt age}_{i} + \gamma {\tt price}_{ij} + \epsilon_{ij}, \hspace{5mm} j = 1,2,3$

Simulation

We can simulate data assuming the data-generating process given in the above model. We will specify the two case-specific variables, household income (hhinc) and age (age), and we will take the price of the plan (price) as the alternative-specific variable. The case-specific variables hhinc and age will be constant across alternatives within each individual, while the alternative-specific variable price will vary over individuals and within individuals over alternatives.

We specify the following population parameters for $$\betab_{j}$$ and $$\gamma$$:

\begin{align*}
\beta_{1,\mathtt{cons}} &= -1, &\beta_{1,\mathtt{hhinc}} &= \hspace{2.7mm} 1, &\beta_{1,\mathtt{age}} &= -1 \\
\beta_{2,\mathtt{cons}} &= -6, &\beta_{2,\mathtt{hhinc}} &= \hspace{2.7mm} 0.5, &\beta_{2,\mathtt{age}} &= \hspace{2.7mm} 1 \\
\beta_{3,\mathtt{cons}} &= \hspace{2.7mm} 2, &\beta_{3,\mathtt{hhinc}} &= -1, &\beta_{3,\mathtt{age}} &= \hspace{2.7mm} 0.5 \\
\gamma &= -0.5 \\
\end{align*}

For $$\epsilon_{ij}$$, we will specify the following:

\begin{equation*}
\epsilon_{ij} {\raise.17ex\hbox{$\scriptstyle\sim$}} \mathcal{N}(0,\Sigma), \hspace{5mm}
\Sigma =
\begin{bmatrix}
2.1 & 0.6 & -0.5 \\
& 1.7 & -0.8 \\
& & 1.4 \\
\end{bmatrix}
\end{equation*}

With these specifications, we can now create a simulated dataset. We start by drawing our three error terms and two case-specific covariates:

. clear

. set seed 65482

. set obs 20000
number of observations (_N) was 0, now 20,000

. generate id = _n

. scalar s11 =  2.1

. scalar s22 =  1.7

. scalar s33 =  1.4

. scalar s12 =  0.6

. scalar s13 = -0.5

. scalar s23 = -0.8

. mat C = (s11,s12,s13)\
>         (s12,s22,s23)\
>         (s13,s23,s33)

. drawnorm e1 e2 e3, cov(C)

. generate double hhinc = max(0,rnormal(5,1.5))

. generate double age = runiformint(20,60)/10


To allow for alternative specific covariates, we will expand our data so that we will have one observation for each alternative for each case, then create an index for the alternatives, and then generate our variables $${\tt price}_{ij}$$:

. expand 3
(40,000 observations created)

. bysort id : gen alt = _n

. generate double price = rbeta(2,2) + 1.50 if alt == 1
(40,000 missing values generated)

. replace         price = rbeta(2,2) + 0.75 if alt == 2
(20,000 real changes made)

. replace         price = rbeta(2,2) + 0.25 if alt == 3
(20,000 real changes made)


We can now go ahead and generate three variables for the observed utility components, one for each alternative:

. generate double xb1 = -1.0 + 1.0*hhinc - 1.0*age - 0.5*price

. generate double xb2 = -6.0 + 0.5*hhinc + 1.0*age - 0.5*price

. generate double xb3 =  2.0 - 1.0*hhinc + 0.5*age - 0.5*price


To calculate the utilities that correspond to each alternative, we add the unobserved to the observed components:

. local snorm = sqrt((s11 + s22 - 2*s12)/2)

. generate double U1 = xb1*snorm' + e1

. generate double U2 = xb2*snorm' + e2

. generate double U3 = xb3*snorm' + e3


Looking at the code above, you will notice that we included a factor to scale our specified population parameters. This is due to identification details related to our model that I explain further in the Identification section. One thing we need to know now, however, is that for the model to be identified, the utilities need to be normalized for level and scale. Normalizing for level is straightforward because, since we are only interested in the utilities relative to each other, we can define a base-level alternative and then take the differences of utilities with respect to the set base. If we set the first alternative as the base, we can rewrite our model as follows:

\begin{align*}U^{*}_{ij} &= \beta_{j,\mathtt{cons}}-\beta_{1,\mathtt{cons}} + (\beta_{j,\mathtt{hhinc}}-\beta_{1,\mathtt{hhinc}}){\tt hhinc}_{i} +
(\beta_{j,\mathtt{age}}-\beta_{1,\mathtt{age}}){\tt age}_{i} \\
&\quad +
\gamma ({\tt price}_{ij}-{\tt price}_{i1}) + \epsilon_{ij}-\epsilon_{i1}, \hspace{5mm} j = 2,3
\end{align*}

This implies that only $$J-1$$ parameter vectors in $$\betab$$ are identified. Let’s define these parameters as

\begin{align*}
\Delta \beta_{j,\mathtt{cons}} &= \beta_{j,\mathtt{cons}}-\beta_{1,\mathtt{cons}} \\
\Delta \beta_{j,\mathtt{hhinc}} &= \beta_{j,\mathtt{hhinc}}-\beta_{1,\mathtt{hhinc}} \\
\Delta \beta_{j,\mathtt{age}} &= \beta_{j,\mathtt{age}}-\beta_{1,\mathtt{age}} \\
\end{align*}

for $$j = 2,3$$. The parameters in $$\betab_{j}$$ that we will try to recover will then be the following differences:

\begin{align*}
\Delta \beta_{2,\mathtt{cons}} &= -5 \\
\Delta \beta_{3,\mathtt{cons}} &= \hspace{2.7mm} 3 \\
\Delta \beta_{2,\mathtt{hhinc}} &= -0.5 \\
\Delta \beta_{3,\mathtt{hhinc}} &= -2 \\
\Delta \beta_{2,\mathtt{age}} &= \hspace{2.7mm} 2 \\
\Delta \beta_{3,\mathtt{age}} &= \hspace{2.7mm} 1.5 \\
\end{align*}

What is left to complete our simulated dataset is to generate the outcome variable that takes the value 1 if observation $$i$$ chooses alternative $$k$$, and 0 otherwise. To do this, we will first create a single variable for the utilities and then determine the alternative with the highest utility:

. quietly generate double U = .

. quietly generate y = .

. forval i = 1/3 {
2.     quietly replace U = Ui' if alt==i'
3. }

. bysort id : egen double umax_i = max(U)

. forval i = 1/3 {
2.     quietly bysort id : replace y = alt if umax_i == U
3. }

. generate choice = alt == y


We obtain the following by using asmprobit:

. asmprobit choice price, case(id) alternatives(alt) casevars(hhinc age)
> basealternative(1) scalealternative(2) nolog

Alternative-specific multinomial probit      Number of obs      =     60,000
Case variable: id                            Number of cases    =     20,000

Alternative variable: alt                    Alts per case: min =          3
avg =        3.0
max =          3
Integration sequence:      Hammersley
Integration points:               150           Wald chi2(5)    =    4577.15
Log simulated-likelihood = -11219.181           Prob > chi2     =     0.0000

----------------------------------------------------------------------------
choice |     Coef.   Std. Err.      z    P>|z|    [95% Conf. Interval]
-------------+--------------------------------------------------------------
alt          |
price | -.4896106   .0523626    -9.35   0.000   -.5922394   -.3869818
-------------+---------------------------------------------------------------
Sickmaster   | (base alternative)
-------------+--------------------------------------------------------------
Allgood      |
hhinc | -.5006212   .0302981   -16.52   0.000   -.5600043    -.441238
age |  2.001367   .0306663    65.26   0.000    1.941262    2.061472
_cons | -4.980841   .1968765   -25.30   0.000   -5.366711    -4.59497
-------------+--------------------------------------------------------------
Cowboy_Hea~h |
hhinc | -1.991202   .1092118   -18.23   0.000   -2.205253    -1.77715
age |  1.494056   .0446662    33.45   0.000    1.406512    1.581601
_cons |  3.038869   .4066901     7.47   0.000    2.241771    3.835967
-------------+--------------------------------------------------------------
/lnl2_2 |  .5550228   .0742726     7.47   0.000    .4094512    .7005944
-------------+--------------------------------------------------------------
/l2_1 |   .667308   .1175286     5.68   0.000    .4369562    .8976598
----------------------------------------------------------------------------
(alt=Sickmaster is the alternative normalizing location)
(alt=Allgood is the alternative normalizing scale)


Looking at the above output, we see that the coefficient of the alternative-specific variable price is $$\widehat \gamma = -0.49$$, which is close to our specified population parameter of $$\gamma = -0.50$$. We can say the same about our case-specific variables. The estimated coefficients of hhinc are $$\widehat \Delta \beta_{2,\mathtt{hhinc}} = -0.50$$ for the second and $$\widehat \Delta \beta_{3,\mathtt{hhinc}} = -1.99$$ for the third alternative. The estimates for age are $$\widehat\Delta \beta_{2,\mathtt{age}} = 2.00$$ and $$\widehat \Delta \beta_{3,\mathtt{age}} = 1.49$$. The estimated differences in alternative-specific constants are $$\widehat \Delta \beta_{2,\mathtt{cons}} = -4.98$$ and $$\widehat \Delta \beta_{3,\mathtt{cons}} = 3.04$$.

Identification

Now let me shed more light on the identification details related to our model that we needed to consider when we simulated our dataset. An important feature of $$U_{ij}$$ is that the level as well as the scale of utility is irrelevant with respect to the chosen alternative because shifting the level by some constant amount, or multiplying it by a (positive) constant, does not change the rank order of utilities and thus would have no impact on the chosen alternative. This has important ramifications for modeling utilities because without a set level and scale of $$U_{ij}$$, there are an infinite number of parameters in $$V_{ij}$$ that yield the same outcome in terms of the chosen alternatives. Therefore, utilities need to be normalized to identify the parameters of the model.

We already saw how to normalize for level. Normalizing for scale is a bit more difficult, though, because we assume correlated and heteroskedastic errors. Because of the hetersokedasticity, we need to set the scale for one of the variances and then estimate the other variances in relation to the set variance. We must also account for the nonzero covariance between the errors, which makes additional identifying restrictions necessary. It turns out that given our model assumptions, only $$J(J-1)/2-1$$ parameters of our variance–covariance matrix are identifiable (see chapter 5 in Train [2009] for details about identifying restrictions in the context of probit models). To be concrete, our original variance–covariance matrix was the following:

\begin{equation*}
\Sigma =
\begin{bmatrix}
\sigma_{11} & \sigma_{12} & \sigma_{13} \\
& \sigma_{22} & \sigma_{23} \\
& & \sigma_{33} \\
\end{bmatrix}
\end{equation*}

Taking differences of correlated errors reduces the $$3 \times 3$$ matrix of error variances to a $$2 \times 2$$ variance–covariance matrix of error differences:

\begin{equation*}
\Sigma^{*} =
\begin{bmatrix}
& \sigma_{11}+\sigma_{22}-2\sigma_{12} & \sigma_{11}+\sigma_{23}-\sigma_{12}-\sigma_{13} \\
& & \sigma_{11}+\sigma_{33}-2\sigma_{13} \\
\end{bmatrix}
\end{equation*}

If we normalize this matrix with respect to the second alternative, we get

\begin{equation*}
\widetilde \Sigma^{*} =
\begin{bmatrix}
& 1 & (\sigma_{11}+\sigma_{23}-\sigma_{12}-\sigma_{13})/\nu \\
& & (\sigma_{11}+\sigma_{33}-2\sigma_{13})/\nu \\
\end{bmatrix}
\end{equation*}

where $$\nu = \sigma_{11}+\sigma_{22}-2\sigma_{12}$$. Because we also have to set the scale for our base alternative, our normalized matrix becomes

\begin{equation*}
\check \Sigma^{*} =
\begin{bmatrix}
& 2 & 2(\sigma_{11}+\sigma_{23}-\sigma_{12}-\sigma_{13})/\nu \\
& & 2(\sigma_{11}+\sigma_{33}-2\sigma_{13})/\nu \\
\end{bmatrix}
\end{equation*}

Thus, because utilities are scaled by the standard deviation, they are divided by $$\sqrt{\nu/2}$$. Now, getting back to our simulation, if we wish to recover our specified parameters, we need to scale them accordingly. We start from the variance–covariance matrix of error differences:

\begin{equation*}
\Sigma^{*} =
\begin{bmatrix}
& 2.1 + 1.7 – 2*0.6 & 2.1 – 0.8 – 0.6 + 0.5 \\
& & 2.1 + 1.4 – 2*-0.5 \\
\end{bmatrix}
=
\begin{bmatrix}
& 2.6 & 1.2 \\
& & 4.5 \\
\end{bmatrix}
\end{equation*}

Normalizing with respect to the second alternative yields

\begin{equation*}
\widetilde \Sigma^{*} =
\begin{bmatrix}
& 1 & 1.2/2.6 \\
& & 4.5/2.6 \\
\end{bmatrix}
=
\begin{bmatrix}
& 1 & 0.4615 \\
& & 1.7308 \\
\end{bmatrix}
\end{equation*}

and then multiplying $$\tilde \Sigma^{*}$$ by 2 yields

\begin{equation*}
\check \Sigma^{*} =
\begin{bmatrix}
& 2 & 0.9231 \\
& & 3.4615 \\
\end{bmatrix}
\end{equation*}

which are the true variance–covariance parameters. Our scaling term is $$\sqrt{2.6/2}$$, and because utilities will be divided by this term, we will need to multiply our parameters by this term.

Finally, we check if we can recover our variance–covariance parameters. We use the postestimation command estat covariance to display the estimated variance–covariance matrix of error differences:

. estat covariance

+-------------------------------------+
|              |   Allgood  Cowboy_~h |
|--------------+----------------------|
|      Allgood |         2            |
| Cowboy_Hea~h |   .943716   3.479797 |
+-------------------------------------+
Note: Covariances are for alternatives differenced with Sickmaster.
`

We see that our estimate is close to the true normalized covariance matrix.

Conclusion

I discussed multinomial probit models in a discrete choice context and showed how to generate a simulated dataset accordingly. In my next post, we will use our simulated dataset and discuss estimation and interpretation of model results, which is not as straightforward as one might think.

Reference

Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. New York: Cambridge University Press.

• Syed Basher

I am working on a model of demand for alternative transportation modes. So your discussion has been extremely helpful. I look forward to the second part. Thanks.

• Joerg Luedicke

Glad to hear that it’s helpful. Thanks for the comment!

• Bo Chen

Hi, could you elaborate a little bit more on why multiplying 2 “to set the scale for our base alternative”? Thanks!