Home > Statistics > Fun with frames

Fun with frames

I have a confession. I wasn’t excited about the addition of frames to Stata 16. Yes, frames has been one of the most requested features for many years, and our website analytics show that frames is wildly popular. Adding frames was a smart decision and our customers are excited. But I have used Stata for over 20 years, and I have been perfectly happy using one dataset at a time. So I ignored frames.

Then I started working on an example for lasso using genetic data. I simulated patient data along with genetic data for each of 22 chromosomes saved in 22 separate datasets. Working with 23 datasets became cumbersome, so I thought I’d check out frames. I began by reading the manual and then tinkered with my genetic data. Along the way, I discovered a feature of frames that completely blew my mind. I’m going to show you that feature below, and I expect that it will blow your mind as well.

This blog post is not meant to be an introduction to frames. There is a detailed introduction to frames in the Stata 16 manual that will make you an expert. I simply want to show you some of the useful things that you can do with frames, including the following:

  1. Use multiple datasets simultaneously
  2. Frames and postestimation
  3. Linked frames and frval()
  4. Linked frames and frget
  5. Many frames

Back to table of contents

Use multiple datasets simultaneously

In this example, I have simulated patient data in one dataset and data from the NHANES study, which is a nationally representative sample of people in the United States. I would like to compare the results of regression models using my data with the results using the NHANES data. The NHANES data contain survey weights and my data do not. It might be possible to append the two datasets into a single dataset and use an indicator variable to differentiate between the observations from each study. But it would be difficult and frustrating to fit models with survey weights for some observations and not others. This is easy using frames.

Let’s begin by using frame create to create a new data frame named patients. Then, we can use the frame prefix to use the patient data in the data frame patients.

. frame create patients

. frame patients: use patients

We can then create another data frame named nhanes2 and webuse the dataset nhanes2.

. frame create nhanes2

. frame nhanes2: webuse nhanes2

We can type frame dir to verify that we have three data frames in memory. One is named default and the other two are the frames we created. We can also see the number of observations, the number of variables, and the datasets that are contained in each frame.

. frame dir
  default   0 x 0
  nhanes2   10351 x 58; nhanes2.dta
  patients  10000 x 41; patients.dta

Now, we can use the frames prefix to fit a linear regression model without survey weights using the patient data. And we can use the frames prefix to fit the same regression model with survey weights using the NHANES data. Note that I have used estimates store to store the parameter estimates in memory.

frame patients: regress sbp c.age##c.bmi
(output omitted)
estimates store patients
frame nhanes2: svy linearized : regress bpsystol c.age##c.bmi
(output omitted)
estimates store nhanes

Now, we can compare the results of our model fit to the two datasets using estimates table.

. estimates table patients nhanes

----------------------------------------
    Variable |  patients      nhanes
-------------+--------------------------
         age |   .5587357    .51511664
         bmi |   1.259132    1.2066527
             |
 c.age#c.bmi |  .00742543    .00200162
             |
       _cons |  70.882924     72.51606
----------------------------------------

Back to table of contents

Frames and postestimation

Next, we might wish to graph the predictions from the model using our patient data. We can do this using a technique that I describe in a Stata News article titled Visualizing continuous-by-continuous interactions with margins and twoway contour.

The basic strategy is to fit the model, use margins to estimate and save the predictions to a dataset, then use graph twoway contour to create the graph. In the Stata News article, I had to clear the main dataset to open the predictions dataset. Here we can simply create another frame to use the dataset that contains our predictions.

Let’s begin by using frame change so that we can run multiple commands using the patient data. Then we can fit our regression model and estimate the predictions using margins. Note that I have included the option saving(predictions, replace), which tells margins to store the predictions to a dataset named predictions.

. frame change patients

. regress sbp c.age##c.bmi

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(3, 9996)      =  14015.35
       Model |   1670892.7         3  556964.234   Prob > F        =    0.0000
    Residual |  397237.042     9,996  39.7396001   R-squared       =    0.8079
-------------+----------------------------------   Adj R-squared   =    0.8079
       Total |  2068129.74     9,999  206.833658   Root MSE        =    6.3039

------------------------------------------------------------------------------
         sbp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .5587357   .0865853     6.45   0.000     .3890112    .7284602
         bmi |   1.259132   .1758975     7.16   0.000     .9143375    1.603927
             |
 c.age#c.bmi |   .0074254   .0034766     2.14   0.033     .0006105    .0142404
             |
       _cons |   70.88292   4.305602    16.46   0.000     62.44308    79.32277
------------------------------------------------------------------------------

. margins, at(age=(20(10)60) bmi=(10(5)40)) saving(predictions, replace)
(Output omitted)

Now, we can create a new frame named contour, change to that frame, and use the dataset predictions.

. frame create contour

. frame change contour

. use predictions
(Created by command margins; also see char list)

The variable _at1 in the predictions dataset contains age, _at2 contains BMI, and _margin contains the linear predictions.

. describe _at1 _at2 _margin

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------
_at1            byte    %9.0g                 age in years
_at2            byte    %9.0f                 Body Mass Index (BMI)
_margin         float   %9.0g                 Linear prediction, predict()

. list _at1 _at2 _margin in 1/5

     +------------------------+
     | _at1   _at2    _margin |
     |------------------------|
  1. |   20     10   96.13405 |
  2. |   20     15   103.1722 |
  3. |   20     20   110.2104 |
  4. |   20     25   117.2487 |
  5. |   20     30   124.2869 |
     +------------------------+

We can rename those variables and create a contour plot of the predictions.

. rename _at1 age

. rename _at2 bmi

. rename _margin pr_sbp

. twoway (contour pr_sbp bmi age, ccuts(90(10)170)),            ///
>         xlabel(20(10)60)  ylabel(10(5)40, angle(horizontal))  ///
>         xtitle("Age (years)") ytitle("BMI")                   ///
>         ztitle("Predicted Systolic Blood Pressure")           ///
>         title("Predicted SBP by Age and BMI")

Figure 1: Contour Plot for Predicted SBP by Age and BMI
graph1

Back to table of contents

Linked frames and frval()

We can also link two frames together using frlink. Let’s open another dataset named hospital in a new frame named hospital.

. frame create hospital

. frame change hospital

. use hospital

. list in 1/5, abbrev(10)

     +----------------------+
     | hospitalid   avgcost |
     |----------------------|
  1. |          1   135,242 |
  2. |          2   135,372 |
  3. |          3   134,467 |
  4. |          4   132,953 |
  5. |          5   134,622 |
     +----------------------+

The hospital dataset contains the variable hospitalid and a variable named avgcost, which is the average cost of heart bypass surgery at each hospital.

The patients dataset contains a variable named payment, which is the amount that each patient paid for heart bypass surgery. The variable hospitalid tells us the hospital at which each patient had surgery.

. frame change patients

. list patientid hospitalid payment in 1/5, abbrev(10)

     +----------------------------------+
     | patientid   hospitalid   payment |
     |----------------------------------|
  1. |         1           10   121,255 |
  2. |         2           18    90,164 |
  3. |         3            5   156,731 |
  4. |         4            6    98,256 |
  5. |         5           13   140,604 |
     +----------------------------------+

We can link the patient frame with the hospital frame using frlink. Each value of hospitalid appears many times in the patients frame and only once in the hospital frame. We can specify a “many-to-one” link by typing frlink m:1. And frame(hospital) tells frlink to link the current frame (patients) with the hospital frame.

. frlink m:1 hospitalid, frame(hospital)
  (all observations in frame patients matched)

Now that our frames are linked, we can use frval() to access values of variables in frames outside the current frame. For example, we might wish to calculate how much a patient paid for heart bypass surgery relative to the average cost at the hospital where the surgery was performed. We can generate a new variable named rel_payment, which is equal to the ratio of payment in the current frame (patients), divided by avgcost in the hospital frame. The function frval(hospital, avgcost) accesses the value of avgcost in the hospital frame.

. generate rel_payment = payment / frval(hospital, avgcost)

. list patientid hospitalid payment rel_payment in 1/5, abbrev(16)

     +------------------------------------------------+
     | patientid   hospitalid   payment   rel_payment |
     |------------------------------------------------|
  1. |         1           10   121,255      .8999608 |
  2. |         2           18    90,164      .6621906 |
  3. |         3            5   156,731      1.164228 |
  4. |         4            6    98,256      .7454461 |
  5. |         5           13   140,604      1.033123 |
     +------------------------------------------------+

Patient 3 in the list above paid $156,731 for surgery at Hospital 5. The hospital data above show that the average cost of the surgery at Hospital 5 was $134,622. The value of rel_payment tells us that patient 3 paid 16% more for their surgery than the average cost at hospital 5.

Back to table of contents

Linked frames and frget

We can also copy variables from one linked frame to another frame using frget. In this example, we have another dataset named longitudinal, which contains repeated observations of bmi and sbp over five years. Let’s create a new data frame named long for this dataset.

. frame create long

. frame change long

. use longitudinal

. list in 1/10, abbrev(10)

     +-----------------------------+
     | patientid   age   bmi   sbp |
     |-----------------------------|
  1. |         1    44    25   139 |
  2. |         1    45    26   144 |
  3. |         1    46    27   147 |
  4. |         1    47    25   137 |
  5. |         1    48    25   139 |
     |-----------------------------|
  6. |         2    69    26   164 |
  7. |         2    70    25   166 |
  8. |         2    71    28   169 |
  9. |         2    72    27   171 |
 10. |         2    73    24   165 |
     +-----------------------------+

We would like to fit a multilevel model for these longitudinal data stored in the long frame, adjusting for the effects of sex and race, which are stored in the patients frame. Let’s begin by using frlink to link the two frames.

. frlink m:1 patientid, frame(patients)
  (all observations in frame long matched)

Next, we can use frget to copy the variables sex and race from the patients frame to the current frame (long).

. frget sex race, from(patients)
  (2 variables copied from linked frame)

. list in 1/10, abbrev(10)

     +---------------------------------------------------------+
     | patientid   age   bmi   sbp   patients      sex    race |
     |---------------------------------------------------------|
  1. |         1    44    25   139          1     Male   White |
  2. |         1    45    26   144          1     Male   White |
  3. |         1    46    27   147          1     Male   White |
  4. |         1    47    25   137          1     Male   White |
  5. |         1    48    25   139          1     Male   White |
     |---------------------------------------------------------|
  6. |         2    69    26   164          2   Female   White |
  7. |         2    70    25   166          2   Female   White |
  8. |         2    71    28   169          2   Female   White |
  9. |         2    72    27   171          2   Female   White |
 10. |         2    73    24   165          2   Female   White |
     +---------------------------------------------------------+

Now, we can fit our multilevel model.

. mixed sbp i.sex i.race c.age##c.bmi || patientid:, nolog noheader

------------------------------------------------------------------------------
         sbp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |   .1787433   .1213193     1.47   0.141    -.0590382    .4165247
             |
        race |
      Black  |   .0270491   .1957962     0.14   0.890    -.3567045    .4108026
      Other  |   .0539184   .4375286     0.12   0.902    -.8036219    .9114587
             |
         age |     .48519   .0158596    30.59   0.000     .4541058    .5162742
         bmi |   1.175005   .0322024    36.49   0.000     1.111889     1.23812
             |
 c.age#c.bmi |    .010324   .0006217    16.61   0.000     .0091055    .0115424
             |
       _cons |    72.9325   .8130494    89.70   0.000     71.33895    74.52605
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
patientid: Identity          |
                  var(_cons) |    35.8953   .5189054      34.89254    36.92689
-----------------------------+------------------------------------------------
               var(Residual) |   3.973389   .0280961      3.918702     4.02884
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 76974.75      Prob >= chibar2 = 0.0000

Back to table of contents

Many frames

I mentioned earlier that I’ve been experimenting with simulated genetic data and lasso. Let’s assume that I sent blood samples from 500 patients to a lab and they sent us 22 datasets. Each of the 22 datasets contains indicator variables for the presence of a particular genetic marker. Let’s take a look at the files.

. ls

  79.6M   7/22/19  9:40  chromosome1.dta
  42.8M   7/22/19  9:40  chromosome10.dta
  43.2M   7/22/19  9:40  chromosome11.dta
  42.6M   7/22/19  9:40  chromosome12.dta
  36.6M   7/22/19  9:40  chromosome13.dta
  34.2M   7/22/19  9:40  chromosome14.dta
  32.6M   7/22/19  9:40  chromosome15.dta
  28.8M   7/22/19  9:40  chromosome16.dta
  26.6M   7/22/19  9:40  chromosome17.dta
  25.7M   7/22/19  9:40  chromosome18.dta
  18.7M   7/22/19  9:40  chromosome19.dta
  77.4M   7/22/19  9:40  chromosome2.dta
  20.6M   7/22/19  9:40  chromosome20.dta
  14.9M   7/22/19  9:40  chromosome21.dta
  16.2M   7/22/19  9:40  chromosome22.dta
  63.4M   7/22/19  9:40  chromosome3.dta
  60.8M   7/22/19  9:40  chromosome4.dta
  58.0M   7/22/19  9:40  chromosome5.dta
  54.6M   7/22/19  9:40  chromosome6.dta
  51.0M   7/22/19  9:40  chromosome7.dta
  46.4M   7/22/19  9:40  chromosome8.dta
  44.3M   7/22/19  9:41  chromosome9.dta

The first thing that I notice is that each file is fairly large. The file chromosome1.dta is 79.6 megabytes. Let’s describe the contents of the file chromosome1.dta without opening the file.

. describe using chromosome1, short

Contains data
  obs:           500                          15 Jul 2019 17:30
 vars:        73,090
Sorted by:

This dataset contains 500 observations on 73,090 variables, which tells me that I will need to use Stata/MP to open this file. Stata/MP can open files with up to 120,000 variables and over 20 billion observations. All the examples below require Stata/MP.

We will also need to change some of Stata’s defaults to work with these large datasets.

. set segmentsize 100m    // Default is 32m

. set maxvar 80000       // Default is 5000

set segmentsize allocates memory for data in units of segmentsize. Smaller values of segmentsize can result in more efficient use of available memory but require Stata to jump around more. Our largest file is chromosome1.dta, which is 79.6 megabytes, so I have changed the segmentsize to 100 megabytes.

set maxvar changes the maximum number of variables allowed from the default of 5,000 to 80,000.

You can read more about the details of memory management in the manual.

Now we are ready to open the dataset chromosome1.dta.

. frame create chr1

. frame change chr1

. use chromosome1

. list id gene1_snp1 gene1_snp2 gene1_snp3 gene1_snp4 gene1_snp5 in 1/5, abbrev(12)

     +---------------------------------------------------------------------+
     | id   gene1_snp1   gene1_snp2   gene1_snp3   gene1_snp4   gene1_snp5 |
     |---------------------------------------------------------------------|
  1. |  1            0            0            0            1            0 |
  2. |  2            0            0            0            0            0 |
  3. |  3            0            0            0            0            0 |
  4. |  4            1            0            0            1            1 |
  5. |  5            0            0            0            1            0 |
     +---------------------------------------------------------------------+

The list above includes only 6 of the 73,090 variables in the dataset. The variable id in the long frame corresponds to the variable patientid in the patients frame. So we will need to rename the variable patientid to id before we can link the two frames.

. frame patients: rename patientid id

. frlink 1:1 id, frame(patients)
  (all observations in frame chr1 matched)

Next, we can use frget to copy the variables sbp, age, sex, race, and bmi from the patients frame to the current frame (chr1).

. frget sbp age sex race bmi, from(patients)
  (5 variables copied from linked frame)

Now, we’re ready to use dsregress to fit an inferential lasso model that includes the dependent variable sbp, variables of interest age, sex, race, and bmi, and controlling for the 73,089 genetic variables.

. dsregress sbp i.sex i.race c.age##c.bmi, controls(gene*) rseed(16)

Estimating lasso for sbp using plugin

(notes omitted)

Double-selection linear model         Number of obs               =        500
                                      Number of controls          =     73,089
                                      Number of selected controls =          1
                                      Wald chi2(6)                =    2214.13
                                      Prob > chi2                 =     0.0000

------------------------------------------------------------------------------
             |               Robust
         sbp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |    .509223   .5573234     0.91   0.361    -.5831107    1.601557
             |
        race |
      Black  |  -.5006748   .6594229    -0.76   0.448     -1.79312    .7917703
      Other  |   .1005796   .7350127     0.14   0.891    -1.340019    1.541178
             |
         age |  -.2644798   .3941441    -0.67   0.502    -1.036988    .5080285
         bmi |  -.5820806    .806618    -0.72   0.471    -2.163023    .9988617
             |
 c.age#c.bmi |   .0416917   .0158851     2.62   0.009     .0105574     .072826
------------------------------------------------------------------------------
Note: Chi-squared test is a Wald test of the coefficients of the variables
      of interest jointly equal to zero. Lassos select controls for model
      estimation. Type lassoinfo to see number of selected variables in each
      lasso.

We just fit a model that includes over 70,000 variables!

I told you at the beginning that I discovered a feature of frames that blew my mind, and you may be guessing that was it. But it wasn’t.

Stata/MP can open datasets with up to 120,000 variables and over 20 billion observations. While I was experimenting with my genetics examples, I wondered “Is it 120,000 variables total or 120,000 variables in each frame?” So I created one frame for chromosome 1 with 73,090 variables and another frame for chromosome 2 with 71,101. That’s 144,191 variables in memory at one time! So I got carried away and wrote the following loop:

forvalues i = 1/22 {
   frame create chr`i'
   frame change chr`i'
   use chromosome`i'
   frlink 1:1 id, frame(patients)
   frget sbp age sex race bmi, from(patients)
}

I had over 900,000 variables open in Stata simultaneously! This blew my mind. Perhaps it blows your mind too. If it doesn’t, consider that Stata/MP can use up to 100 frames simultaneously. Which means that Stata/MP can theoretically use 12,000,000 variables! That’s a lot of data. And that’s some of what you can do with frames in Stata 16.