Home > Statistics > Teach statistical concepts with Stata

Teach statistical concepts with Stata


My first statistics course primarily consisted of plugging numbers into formulas. I did not leave that course with any real idea of how statistics differed from basic algebra. The next course I took is what put it all together, and I’ve loved statistics ever since. In that course, we examined the relationships between a population and its samples. I learned that using just a small amount of data, statistics enables us to make inferences about the entire population. Now that is cool!

In this blog, I provide tools for introductory statistics course instructors to demonstrate the concepts of sampling distributions, standard errors, confidence intervals (CIs), and p-values. To illustrate these concepts, we’ll use the following research question:

Do residents of College Station, TX (CSTX), who have been diagnosed with diabetes have a higher average body mass index (BMI) than CSTX residents without a diabetes diagnosis?

The population

If you have a dataset to use as a population—wonderful! Use that. Or you can have your students collect data, for example, from every 11th grader. Otherwise, you’ll need to simulate a population dataset. I have no means to collect data from everyone in CSTX, so I will demonstrate using a simulated population.

First, I create a new data frame called population and change into it.

. frame create population
. frame change population

I set a random-number seed to ensure reproducibility and set the number of observations. I’m using 135,000 because that’s the estimated number of residents in CSTX, my population.

. set seed 12
. set obs 135000
Number of observations (_N) was 0, now 135,000.

Then we will use Stata’s random-number generators to simulate BMI and diabetes incidence in the population. See [FN] Random-number functions, or type help random number functions into Stata’s Command window for a full list of available distributions.

We will use a normal distribution for BMI and a binomial distribution for diabetes. I do a quick internet search to get hyperparameters for these distributions, that is, the mean and standard deviation of BMI in the US population and the proportion with a diabetes diagnosis. The number of events in the binomial distribution is 1 because we are simulating only one event: diabetes. I add 2 to everyone in the diabetes group so that their average BMI is 2 higher than the nondiabetic group.

. generate diabetes = rbinomial(1,0.12)
. generate bmi = rnormal(27,6) + 2 * diabetes

It’s good practice to assign labels and value labels (if applicable) to generated variables. Let’s do
that now.

. label variable bmi "Body mass index"
. label variable diabetes "Diabetes diagnosis"
. label define diab 0 "Not diabetic" 1 "Diabetic"
. label values diabetes diab

Samples

We can extract a single sample from this population using the sample command. We’ll sample just 0.1% of the population. We first use preserve to preserve our population.

. preserve
. sample 0.1
(134,865 observations deleted)

Let’s conduct a t test on this sample to test whether the mean BMI differs by diabetes group.

. ttest bmi, by(diabetes)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
Not diab |     114    28.31035    .5315609    5.675517    27.25723    29.36347
Diabetic |      21    29.49885      .91868    4.209921    27.58252    31.41518
---------+--------------------------------------------------------------------
Combined |     135    28.49523    .4713703    5.476828    27.56294    29.42752
---------+--------------------------------------------------------------------
    diff |           -1.188497    1.301377                -3.76257    1.385575
------------------------------------------------------------------------------
    diff = mean(Not diab) - mean(Diabetic)                        t =  -0.9133
H0: diff = 0                                     Degrees of freedom =      133

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.1814         Pr(|T| > |t|) = 0.3628          Pr(T > t) = 0.8186

The first two sections give some information about each group and their combined sample. The last row, labeled diff, is what we are testing. The difference in means in BMI in this sample is 1.19 (SE = 1.30; 95% CI [-3.76, 1.39]). The difference in means estimate in the results table is negative because it subtracts the diabetic mean BMI from the nondiabetic mean BMI.

The t statistic is just the difference in means divided by its standard error. We can calculate this manually by using the difference in means and standard error in the returned results.

. return list

scalars:
            r(ub_diff) =  1.385575269027033
            r(lb_diff) =  -3.762570056926808
            r(mu_diff) =  -1.188497393949888
        r(ub_combined) =  29.42751874623384
        r(lb_combined) =  27.56294203811313
        r(se_combined) =  .4713703167472302
        r(mu_combined) =  28.49523039217349
         r(N_combined) =  135
               r(ub_2) =  31.41518328412929
               r(lb_2) =  27.58251754333305
               r(se_2) =  .9186799859364339
               r(ub_1) =  29.36347104716379
               r(lb_1) =  27.25723499239878
               r(se_1) =  .5315609062940896
              r(level) =  95
                 r(sd) =  5.476828159975613
               r(sd_2) =  4.209920574994674
               r(sd_1) =  5.675517392222678
                 r(se) =  1.301376679922003
                r(p_u) =  .8186212430198624
                r(p_l) =  .1813787569801376
                  r(p) =  .3627575139602752
                  r(t) =  -.9132616346107564
               r(df_t) =  133
               r(mu_2) =  29.49885041373117
                r(N_2) =  21
               r(mu_1) =  28.31035301978128
                r(N_1) =  114

We use the difference in means, saved as r(mu_diff), and the standard error, saved as r(se), to calculate t.

. display r(mu_diff)/r(se)
-.91326163

This is the same t statistic reported under the results table. If you’re ever unsure how a statistic is being calculated, visit the Methods and formulas in [R] ttest. Using this sample, we don’t have enough evidence to conclude that there is a difference between groups, t(133) = 0.91, p = 0.181. We use the lower-tail p-value because our hypothesis is that the diabetic group had a higher mean BMI than the nondiabetic group.

What if we had sampled a different set of observations? First, let’s restore and then preserve our population again.

. restore
. preserve

Now, let’s use a different random-number seed, then sample and test again.

. set seed 78

. sample 0.1
(134,865 observations deleted)

. ttest bmi, by(diabetes)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
Not diab |     119    27.38515    .5024667    5.481265    26.39013    28.38017
Diabetic |      16    30.22711    1.023486    4.093946     28.0456    32.40862
---------+--------------------------------------------------------------------
Combined |     135    27.72197    .4649424    5.402142     26.8024    28.64155
---------+--------------------------------------------------------------------
    diff |           -2.841961    1.422678               -5.655963   -.0279588
------------------------------------------------------------------------------
    diff = mean(Not diab) - mean(Diabetic)                        t =  -1.9976
H0: diff = 0                                     Degrees of freedom =      133

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0239         Pr(|T| > |t|) = 0.0478          Pr(T > t) = 0.9761

Using this sample, we have evidence that there is a group difference in mean BMI, t(133) = 2.00, p = 0.024.

Let’s put this sample into a frame called sample, then restore our population.

. frame put *, into(sample)
. restore

Sample-to-sample variability is key to being able to make statistical inferences. Because we have the population, we can build a sampling distribution to see this variability firsthand.

The sampling distribution

We start by creating a new data frame named sampling with variables diff, se, ub, and lb.

. frame create sampling diff se ub lb

Then we use a for loop to collect 100 samples, each containing 0.1% of the population. We conduct t tests on each of these samples and post the estimated difference in means, standard error, and upper and lower bounds of the CI to the sampling frame. We add quietly in front of sample and ttest to suppress their outputs.

. forvalues i = 1/100 {
  2. preserve
  3. quietly sample 0.1
  4. quietly ttest bmi, by(diabetes)
  5. frame post sampling (r(mu_diff)) (r(se)) (r(ub_diff)) (r(lb_diff))
  6. restore
  7. }

We change to the sampling frame we just created.

. frame change sampling

Let’s begin by looking at a histogram of the difference in means.

. histogram diff
(bin=10, start=-6.039753, width=.72761365)

graph1

This is a sampling distribution of the difference in means in BMI by diabetes diagnosis. We can see that most of our samples are around -2, as expected, but there is a substantial spread. Also note that it is approximately normally distributed. This is due to the central limit theorem, which states that the distribution of sample means of independent, identically distributed random variables approaches a normal distribution as the sample size increases, regardless of the original population’s distribution. (Having students start with other population distributions can be a fun exercise!)

Standard error

The standard deviation of the sampling distribution is called the standard error. When we conduct a t test on a sample, we also get an estimate of the standard error. It tells us about the expected distance between our sample estimate and the population parameter.

. summarize se diff

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          se |        100    1.610917    .2193768   1.206266   2.400811
        diff |        100   -1.956383    1.477907  -6.039753   1.236384

The estimated standard error according to this sampling distribution is 1.48. We see that the average estimated standard error from the samples, 1.61, is close to this number, even with only 100 samples. As the number of samples in our sampling distribution increases, these two numbers will converge, and the difference in means will approach -2.

You can also have your students use different sample sizes when building a sampling distribution to see how this impacts the standard error. Use the count option of the sample command to directly sample a specific number of observations rather than a percentage.

CI

The CI is another way of thinking about the variability across samples. If we repeat the sampling process many times, 95% of the 95% CIs constructed will contain the true, unknown population parameter. Because we have 100 samples in our sampling distribution, we expect 95 of them to contain -2, the true difference.

Let’s look at the CIs we collected from each of the samples in our sampling distribution.

We sort, then create variable i to order them on the x axis from smallest to largest point estimate, and then use twoway rpcap to visualize them.

. sort diff
. generate i = _n
. twoway (rpcap lb ub diff i)

graph2

Let’s add a reference line at the true difference, -2, and add titles.

. twoway (rpcap lb ub diff i), ytitle("Difference in means") yline(-2) 
> xtitle("") xlabel(none, nolabels) title("100 95% confidence intervals") 
> subtitle("95 contain the true population parameter (dashed line)")

graph3

We can see that the first CI and the last four CIs don’t contain the true value.

When we calculate a CI in a sample, it is the estimated range of values within which we are 95% confident the true population parameter lies. With this demonstration, we can easily see that “95% confident” refers to the percentage of intervals containing the true parameter across repeated samples rather than the often incorrectly used interpretation that there is a 95% chance (or 0.95 probability) that the true parameter lies within the interval constructed from a single sample.

p-values

There is one more concept I would like to cover: p-values. They are infamously misunderstood and misused. Thus, I think it’s important that students understand what they are and what they really tell us.

To understand p-values will require building a null sampling distribution, that is, a sampling distribution from a population in which there is no effect. To create a null sampling distribution, we can follow the same process we used for the alternative sampling distribution without adding 2 to the diabetes group, but we don’t have to. We know exactly what the distribution of t statistics will look like from a null population; they will be t distributed! That’s why we bother with creating test statistics in the first place.

Let’s first visualize a t distribution using a twoway function graph.

. twoway (function density = tden(133,x), range(-3 3)), xtitle({it:t})

graph4

This is the sampling distribution of t values we would expect from a population in which there is no difference in mean BMI between diabetics and nondiabetics. As expected, it centers around 0.

Let’s see the t test on our last sample once more.

. frame sample: ttest bmi, by(diabetes)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
Not diab |     119    27.38515    .5024667    5.481265    26.39013    28.38017
Diabetic |      16    30.22711    1.023486    4.093946     28.0456    32.40862
---------+--------------------------------------------------------------------
Combined |     135    27.72197    .4649424    5.402142     26.8024    28.64155
---------+--------------------------------------------------------------------
    diff |           -2.841961    1.422678               -5.655963   -.0279588
------------------------------------------------------------------------------
    diff = mean(Not diab) - mean(Diabetic)                        t =  -1.9976
H0: diff = 0                                     Degrees of freedom =      133

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0239         Pr(|T| > |t|) = 0.0478          Pr(T > t) = 0.9761

We get a t statistic of -1.99, and at the bottom we see three p-values: lower-, two-, and upper-tailed.

The lower-tailed p-value tells us that if there were no difference in means in BMI in the population, the probability of calculating a t statistic that is less than -1.99 is 0.024, or about 2%. We can replicate this number using the cumulative Student’s t distribution function, t(df, t), where df is the degrees of freedom from our sample, 133, and t is the t statistic from our sample, -1.9976.

. display t(133,- 1.9976)
.02390008

This function calculates the cumulative probability of the t distribution up to -1.9976. To visualize this, we add an x line to our t distribution graph.

. twoway (function density = tden(133,x), range(-3 3)), 
> xtitle({it:t}) xline(-2)

graph5

2.4% of the distribution is to the left of the dashed line.

The upper-tailed p-value tells us that if there were no difference in means in BMI in the population, the probability of calculating a t statistic greater than -1.99 is 0.976, or about 98%. Again, we can replicate this number using the cumulative distribution function, this time subtracting from 1.

. display 1 - t(133,- 1.9976)
.97609992

Now we are calculating the probability to the right of the dashed line in our graph, which is 1 – the probability to the left.

The two-tailed p-value is simply twice the lower-valued p-value, in this case the lower-tailed.

. display 2 * t(133,- 1.9976)
.04780016

It tells us the probability of calculating a t statistic less than -1.99 or greater than 1.99. We can add a second line to our graph to visualize this.

. twoway (function density = tden(133,x), range(-3 3)), xtitle({it:t}) 
> xline(-2) xline(2)

graph6

The two-tailed p-value is the proportion to the left of the first dashed line plus the proportion to the right of the second dashed line. In other words, it is telling us that, if there is no true difference between groups, there is 5% probability of getting a t statistic as or more extreme than the statistic we calculated in this sample.

Conclusion

In this blog, we have explored how to demonstrate the following core pillars of statistical inference through simulation and visualization: sampling distributions, standard errors, CIs, and p-values.

Statistics becomes compelling for students when they can see the relationships between a population and its samples for themselves. To bring these demonstrations into your classroom and help your students learn much faster, you can download the do-file here.