## Using resampling methods to detect influential points

As stated in the documentation for **jackknife**, an often forgotten utility for this command is the detection of overly influential observations.

Some commands, like **logit** or **stcox**, come with their own set of prediction tools to detect influential points. However, these kinds of predictions can be computed for virtually any regression command. In particular, we will see that the **dfbeta** statistics can be easily computed for any command that accepts the **jackknife** prefix. **dfbeta** statistics allow us to visualize how influential some observations are compared with the rest, concerning a specific parameter.

We will also compute Cook’s likelihood displacement, which is an overall measure of influence, and it can also be compared with a specific threshold.

### Using jackknife to compute dfbeta

The main task of **jackknife** is to fit the model while suppressing one observation at a time, which allows us to see how much results change when each observation is suppressed; in other words, it allows us to see how much each observation influences the results. A very intuitive measure of influence is **dfbeta**, which is the amount that a particular parameter changes when an observation is suppressed. There will be one **dfbeta** variable for each parameter. If \(\hat\beta\) is the estimate for parameter \(\beta\) obtained from the full data and \( \hat\beta_{(i)} \) is the corresponding estimate obtained when the \(i\)th observation is suppressed, then the \(i\)th element of variable **dfbeta** is obtained as

\[dfbeta = \hat\beta – \hat\beta_{(i)}\]

Parameters \(\hat\beta\) are saved by the estimation commands in matrix **e(b)** and also can be obtained using the **_b** notation, as we will show below. The leave-one-out values \(\hat\beta_{(i)}\) can be saved in a new file by using the option **saving()** with **jackknife**. With these two elements, we can compute the **dfbeta** values for each variable.

Let’s see an example with the **probit** command.

. sysuse auto, clear (1978 Automobile Data) . *preserve original dataset . preserve . *generate a variable with the original observation number . gen obs =_n . probit foreign mpg weight Iteration 0: log likelihood = -45.03321 Iteration 1: log likelihood = -27.914626 Iteration 2: log likelihood = -26.858074 Iteration 3: log likelihood = -26.844197 Iteration 4: log likelihood = -26.844189 Iteration 5: log likelihood = -26.844189 Probit regression Number of obs = 74 LR chi2(2) = 36.38 Prob > chi2 = 0.0000 Log likelihood = -26.844189 Pseudo R2 = 0.4039 ------------------------------------------------------------------------------ foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.1039503 .0515689 -2.02 0.044 -.2050235 -.0028772 weight | -.0023355 .0005661 -4.13 0.000 -.003445 -.0012261 _cons | 8.275464 2.554142 3.24 0.001 3.269437 13.28149 ------------------------------------------------------------------------------ . *keep the estimation sample so each observation will be matched . *with the corresponding replication . keep if e(sample) (0 observations deleted) . *use jackknife to generate the replications, and save the values in . *file b_replic . jackknife, saving(b_replic, replace): probit foreign mpg weight (running probit on estimation sample) Jackknife replications (74) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 ........................ Probit regression Number of obs = 74 Replications = 74 F( 2, 73) = 10.36 Prob > F = 0.0001 Log likelihood = -26.844189 Pseudo R2 = 0.4039 ------------------------------------------------------------------------------ | Jackknife foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.1039503 .0831194 -1.25 0.215 -.269607 .0617063 weight | -.0023355 .0006619 -3.53 0.001 -.0036547 -.0010164 _cons | 8.275464 3.506085 2.36 0.021 1.287847 15.26308 ------------------------------------------------------------------------------ . *verify that all the replications were successful . assert e(N_misreps) ==0 . merge 1:1 _n using b_replic Result # of obs. ----------------------------------------- not matched 0 matched 74 (_merge==3) ----------------------------------------- . *see how values from replications are stored . describe, fullnames Contains data from .../auto.dta obs: 74 1978 Automobile Data vars: 17 13 Apr 2013 17:45 size: 4,440 (_dta has notes) -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn Circle (ft.) displacement int %8.0g Displacement (cu. in.) gear_ratio float %6.2f Gear Ratio foreign byte %8.0g origin Car type obs float %9.0g foreign_b_mpg float %9.0g [foreign]_b[mpg] foreign_b_weight float %9.0g [foreign]_b[weight] foreign_b_cons float %9.0g [foreign]_b[_cons] _merge byte %23.0g _merge -------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved . *compute the dfbeta for each covariate . foreach var in mpg weight { 2. gen dfbeta_`var' = (_b[`var'] -foreign_b_`var') 3. } . gen dfbeta_cons = (_b[_cons] - foreign_b_cons) . label var obs "observation number" . label var dfbeta_mpg "dfbeta for mpg" . label var dfbeta_weight "dfbeta for weight" . label var dfbeta_cons "dfbeta for the constant" . *plot dfbeta values for variable mpg . scatter dfbeta_mpg obs, mlabel(obs) title("dfbeta values for variable mpg") . *restore original dataset . restore

Based on the impact on the Read more…