### Archive

Posts Tagged ‘influential points’

## Using resampling methods to detect influential points

As stated in the documentation for jackknife, an often forgotten utility for this command is the detection of overly influential observations.

Some commands, like logit or stcox, come with their own set of prediction tools to detect influential points. However, these kinds of predictions can be computed for virtually any regression command. In particular, we will see that the dfbeta statistics can be easily computed for any command that accepts the jackknife prefix. dfbeta statistics allow us to visualize how influential some observations are compared with the rest, concerning a specific parameter.

We will also compute Cook’s likelihood displacement, which is an overall measure of influence, and it can also be compared with a specific threshold.

### Using jackknife to compute dfbeta

The main task of jackknife is to fit the model while suppressing one observation at a time, which allows us to see how much results change when each observation is suppressed; in other words, it allows us to see how much each observation influences the results. A very intuitive measure of influence is dfbeta, which is the amount that a particular parameter changes when an observation is suppressed. There will be one dfbeta variable for each parameter. If $$\hat\beta$$ is the estimate for parameter $$\beta$$ obtained from the full data and $$\hat\beta_{(i)}$$ is the corresponding estimate obtained when the $$i$$th observation is suppressed, then the $$i$$th element of variable dfbeta is obtained as

$dfbeta = \hat\beta - \hat\beta_{(i)}$

Parameters $$\hat\beta$$ are saved by the estimation commands in matrix e(b) and also can be obtained using the _b notation, as we will show below. The leave-one-out values $$\hat\beta_{(i)}$$ can be saved in a new file by using the option saving() with jackknife. With these two elements, we can compute the dfbeta values for each variable.

Let’s see an example with the probit command.

. sysuse auto, clear
(1978 Automobile Data)

. *preserve original dataset
. preserve

. *generate a variable with the original observation number
. gen obs =_n

. probit foreign mpg weight

Iteration 0:   log likelihood =  -45.03321
Iteration 1:   log likelihood = -27.914626
Iteration 2:   log likelihood = -26.858074
Iteration 3:   log likelihood = -26.844197
Iteration 4:   log likelihood = -26.844189
Iteration 5:   log likelihood = -26.844189

Probit regression                                 Number of obs   =         74
LR chi2(2)      =      36.38
Prob > chi2     =     0.0000
Log likelihood = -26.844189                       Pseudo R2       =     0.4039

------------------------------------------------------------------------------
foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg |  -.1039503   .0515689    -2.02   0.044    -.2050235   -.0028772
weight |  -.0023355   .0005661    -4.13   0.000     -.003445   -.0012261
_cons |   8.275464   2.554142     3.24   0.001     3.269437    13.28149
------------------------------------------------------------------------------

. *keep the estimation sample so each observation will be matched
. *with the corresponding replication
. keep if e(sample)
(0 observations deleted)

. *use jackknife to generate the replications, and save the values in
. *file b_replic
. jackknife, saving(b_replic, replace):  probit foreign mpg weight
(running probit on estimation sample)

Jackknife replications (74)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
........................

Probit regression                               Number of obs      =        74
Replications       =        74
F(   2,     73)    =     10.36
Prob > F           =    0.0001
Log likelihood = -26.844189                     Pseudo R2          =    0.4039

------------------------------------------------------------------------------
|              Jackknife
foreign |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg |  -.1039503   .0831194    -1.25   0.215     -.269607    .0617063
weight |  -.0023355   .0006619    -3.53   0.001    -.0036547   -.0010164
_cons |   8.275464   3.506085     2.36   0.021     1.287847    15.26308
------------------------------------------------------------------------------

. *verify that all the replications were successful
. assert e(N_misreps) ==0

. merge 1:1 _n using b_replic

Result                           # of obs.
-----------------------------------------
not matched                             0
matched                                74  (_merge==3)
-----------------------------------------

. *see how values from replications are stored
. describe, fullnames

Contains data from .../auto.dta
obs:            74                          1978 Automobile Data
vars:            17                          13 Apr 2013 17:45
size:         4,440                          (_dta has notes)
--------------------------------------------------------------------------------
storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
make            str18   %-18s                 Make and Model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair Record 1978
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn Circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear Ratio
foreign         byte    %8.0g      origin     Car type
obs             float   %9.0g
foreign_b_mpg   float   %9.0g                 [foreign]_b[mpg]
foreign_b_weight
float   %9.0g                 [foreign]_b[weight]
foreign_b_cons  float   %9.0g                 [foreign]_b[_cons]
_merge          byte    %23.0g     _merge
--------------------------------------------------------------------------------
Sorted by:
Note:  dataset has changed since last saved

. *compute the dfbeta for each covariate
. foreach var in mpg weight {
2.  gen dfbeta_var' = (_b[var'] -foreign_b_var')
3. }

. gen dfbeta_cons = (_b[_cons] - foreign_b_cons)

. label var obs "observation number"
. label var dfbeta_mpg "dfbeta for mpg"
. label var dfbeta_weight "dfbeta for weight"
. label var dfbeta_cons "dfbeta for the constant"

. *plot dfbeta values for variable mpg
. scatter dfbeta_mpg obs, mlabel(obs) title("dfbeta values for variable mpg")

. *restore original dataset
. restore


Based on the impact on the coefficient for variable mpg, observation 71 seems to be the most influential. We could create a similar plot for each parameter.

jackknife prints a dot for each successful replication and an ‘x’ for each replication that ends with an error. By looking at the output immediately following the jackknife command, we can see that all the replications were successful. However, we added an assert line in the code to avoid relying on visual inspection. If some replications failed, we would need to explore the reasons.

### A computational shortcut to obtain the dfbeta values

The command jackknife allows us to save the leave-one-out values in a different file. To use these, we would need to do some data management and merge the two files. On the other hand, the same command called with the option keep saves pseudovalues, which are defined as follows:

$\hat{\beta}_i^* = N\hat\beta - (N-1)\hat\beta_{(i)}$

where $$N$$ is the number of observations involved in the computation, returned as e(N). Therefore, using the pseudovalues, $$\beta_{(i)}$$ values can be computed as $\hat\beta_{(i)} = \frac{ N \hat\beta - \hat\beta^*_i}{N-1}$

Also, dfbeta values can be computed directly from the pseudovalues as $\hat\beta - \hat\beta_{(i)} = \frac{\hat\beta_{i}^* -\hat\beta} {N-1}$

Using the pseudovalues instead of the leave-one-out values simplifies our program because we don’t have to worry about matching each pseudovalue to the correct observation.

Let’s reproduce the previous example.

. sysuse auto, clear
(1978 Automobile Data)

. jackknife, keep: probit foreign  mpg weight
(running probit on estimation sample)

Jackknife replications (74)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
........................

Probit regression                               Number of obs      =        74
Replications       =        74
F(   2,     73)    =     10.36
Prob > F           =    0.0001
Log likelihood = -26.844189                     Pseudo R2          =    0.4039

------------------------------------------------------------------------------
|              Jackknife
foreign |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg |  -.1039503   .0831194    -1.25   0.215     -.269607    .0617063
weight |  -.0023355   .0006619    -3.53   0.001    -.0036547   -.0010164
_cons |   8.275464   3.506085     2.36   0.021     1.287847    15.26308
------------------------------------------------------------------------------

. *see how pseudovalues are stored
. describe, fullnames

> dta
obs:            74                          1978 Automobile Data
vars:            15                          13 Apr 2013 17:45
size:         4,070                          (_dta has notes)
--------------------------------------------------------------------------------
storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
make            str18   %-18s                 Make and Model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair Record 1978
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn Circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear Ratio
foreign         byte    %8.0g      origin     Car type
foreign_b_mpg   float   %9.0g                 pseudovalues: [foreign]_b[mpg]
foreign_b_weight
float   %9.0g                 pseudovalues: [foreign]_b[weight]
foreign_b_cons  float   %9.0g                 pseudovalues: [foreign]_b[_cons]
--------------------------------------------------------------------------------
Sorted by:  foreign
Note:  dataset has changed since last saved

. *verify that all the replications were successful
. assert e(N_misreps)==0

. *compute the dfbeta for each covariate
. local N = e(N)

. foreach var in  mpg weight {
2. gen dfbeta_var' = (foreign_b_var' - _b[var'])/(N'-1)
3. }

. gen dfbeta_cons' = (foreign_b_cons - _b[_cons])/(N'-1)

. *plot deff values for variable weight
. gen obs = _n

. label var obs "observation number"

. label var dfbeta_mpg "dfbeta for mpg"

. scatter dfbeta_mpg obs, mlabel(obs) title("dfbeta values for variable mpg")


### Dfbeta for grouped data

If you have panel data or a situation where each individual is represented by a group of observations (for example, conditional logit or survival models), you might be interested in influential groups. In this case, you would look at the changes on the parameters when each group is suppressed. Let’s see an example with xtlogit.

. webuse towerlondon, clear
. xtset family

. jackknife, cluster(family) idcluster(newclus) keep: xtlogit dtlm difficulty
. assert e(N_misreps)==0


The group-level pseudovalues will be saved on the first observations corresponding to each group, and there will be missing values on the rest. To compute the dfbeta value for the coefficient for difficulty, we type

. local N = e(N_clust)
. gen dfbeta_difficulty = (dtlm_b_difficulty - _b[difficulty])/(N'-1)


We can then plot those values:

. scatter dfbeta_difficulty newclus, mlabel(family) ///
title("dfbeta values for variable difficulty") xtitle("family")


Option idcluster() for jackknife generates a new variable that assigns consecutive integers to the clusters; using this variable produces a plot where families are equally spaced on the horizontal axis.

As before, we can see that some groups are more influential than others. It would require some research to find out whether this is a problem.

### Likelihood displacement

If we want a global measure of influence (that is, not tied to a particular parameter), we can compute the likelihood displacement values. We consider the likelihood displacement value as defined by Cook (1986):

$LD_i = 2[L(\hat\theta) - L(\hat\theta_{(i)})]$

where $$L$$ is the log-likelihood function (evaluated on the full dataset), $$\hat\theta$$ is the set of parameter estimates obtained from the full dataset, and $$\hat\theta_{(i)}$$ is the set of the parameter estimates obtained when leaving out the $$i$$th observation. Notice that what changes is the parameter vector. The log-likelihood function is always evaluated on the whole sample; provided that $$\hat\theta$$ is the set of parameters that maximizes the log likelihood, the log-likelihood displacement is always positive. Cook suggested, as a confidence region for this value, the interval $$[0, \chi^2_p(\alpha))$$, where $$\chi^2_p(\alpha)$$ is the ($$1-\alpha$$) quantile from a chi-squared distribution with $$p$$ degrees of freedom, and $$p$$ is the number of parameters in $$\theta$$.

To perform our assessment based on the likelihood displacement, we will need to do the following:

1. Create an $$N\times p$$ matrix B, where the $$i$$th row contains the vector of parameter estimates obtained by leaving the $$i$$th observation out.
2. Create a new variable L1 such that its $$i$$th observation contains the log likelihood evaluated at the parameter estimates in the $$i$$th row of matrix B.
3. Use variable L1 to obtain the LD matrix, containing the likelihood displacement values.
4. Construct a plot for the values in LD, and add the $$\chi^2_p(\alpha)$$ as a reference.

Let's do it with our probit model.

#### Step 1.

We first create the macro cmdline containing the command line for the model we want to use. We fit the model and save the original log likelihood in macro ll0.

With a loop, the leave-one-out parameters are saved in consecutive rows of matrix B. It is useful to have those values in a matrix, because we will then extract each row to evaluate the log likelihood at those values.

**********Step 1
sysuse auto, clear
set more off
local cmdline probit foreign weight mpg
cmdline'
keep if e(sample)
local ll0 = e(ll)
mat b0 = e(b)
mat b = b0

local N = _N

forvalues i = 1(1)N'{
cmdline' if _n !=i'
mat b1 = e(b)
mat b = b \ b1
}

mat B = b[2...,1...]
mat list B


#### Step 2.

In each iteration of a loop, a row from B is stored as matrix b. To evaluate the log likelihood at these values, the trick is to use them as initial values and invoke the command with 0 iterations. This can be done for any command that is based on ml.

**********Step 2

gen L1 = .

forvalues i = 1(1)N'{
mat b = B[i',1...]
cmdline', from(b) iter(0)
local ll = e(ll)
replace L1 = ll' in i'
}


#### Step 3.

Using variable L1 and the macro with the original log likelihood, we compute Cook's likehood displacement.

**********Step 3

gen LD = 2*(ll0' - L1)


#### Step 4.

Create the plot, using as a reference the 90% quantile for the $$\chi^2$$ distribution. $$p$$ is the number of columns in matrix b0 (or equivalently, the number of columns in matrix B).

**********Step 4

local k = colsof(b0)
gen upper_bound = invchi2tail(k', .1)
gen n = _n

twoway scatter LD n, mlabel(n) || line upper_bound n, ///
title("Likelihood displacement")
`

We can see that observation 71 is the most influential, and its likelihood displacement value is within the range we would normally expect.

### Reference

Cook, D. 1986. Assessment of local influence. Journal of the Royal Statistical Society, Series B 48: 133–169.

Categories: Statistics Tags: