From datasets to framesets and alias variables: Data management advances in Stata
The aim of this blog is to describe two novel features introduced in Stata 18 (released in 2023): 1) framesets and 2) alias variables across frames. These features enable Stata to deal with a multiplicity of potentially very large datasets efficiently and conveniently. Framesets allow you to bundle, save on file, and load in memory a set of related frames that hold datasets. Alias variables allow you to access variables in other frames as if they were part of the current frame, with very little memory overhead.
Data management in Stata
When Stata 1.0 was released in 1985, data were organized in a tabular form as observations (rows) and variables (columns) and were called a dataset. Datasets were kept entirely in memory (then measured in kilobytes) and saved on disk as .dta files. Data types, like integers, real numbers, and especially strings, were frugally managed. Most of the maiden 44 commands were for data management, including the still inescapable generate, replace, and list. This underlying framework has remained the bedrock for the 17 versions of Stata that followed: datasets are still kept as tables entirely in memory, with strongly typed languages to process the data. This makes Stata fast and allows billions of observations to be processed in milliseconds. However, holding entire datasets in memory is restrictive with very large datasets. Nonetheless, leveraging on the phenomenal growth of affordable memory, Stata’s data management capabilities kept getting bigger, stronger, and faster. In this blog, I discuss new features for handling large datasets, namely, frames, framesets, and alias variables. I describe these features in detail in the next three sections. In an appendix at the end of this blog, I provide an overview of how Stata’s data management capabilities have grown over time.
Frames: A framework for multiple datasets
With large and complex data, there is often need to work with multiple, and potentially huge, datasets concurrently. You may want to multitask and work with various datasets for various projects. Or you may be working with a set of related datasets and want to consolidate statistics across them. There are Stata commands, like preserve and restore, that enable you to switch from one dataset to another. But those require some careful coding and entail a time penalty for saving and restoring datasets to and from disk.
In Stata 16 (2019), a new framework for handling multiple datasets was introduced: frames. Multiple datasets can be kept in memory in multiple frames. For example, here is how you can create a frame with frame create, make that frame the current (working) frame with frame change, and load a dataset into it:
. frame create auto . frame change auto . sysuse auto (1978 automobile data)
You can make a copy of a frame and rename a frame:
. frame copy auto auto1 . frame rename auto1 cars
Names of datasets and frames that hold them can be different. Also, even if there are multiple frames in memory, you can interactively work with one frame (the current frame) at a time. You can identify the current frame with pwf (print working frame):
. pwf (current frame is auto)
You always work with the current frame, by default. That said, the frame prefix capability allows you to run a command on a frame other than the current one. For example, you can generate a new variable, say, newvar (with random values here), in frame cars:
. frame cars: generate newvar = runiform()
You can also use frlink to create a link between the current frame and another frame. For instance, you can create a one-to-one link (by specifying 1:1) between current frame auto and frame cars by matching the observations on variable make (that keeps makes of cars):
. frlink 1:1 make, frame(cars) (all observations in frame auto matched)
You can delete a frame (if it’s not the current one) with frame drop:
. frame drop cars
You can reset frames with
. frames reset
That will reset Stata to a state where a single, empty frame is in memory.
You can do more with frames: copy data variables and observations with frame put, add new observations with frame post, etc.
Note that commands related to frames and framesets work exactly the same way whether you type frame or frames; they are synonymous. For a good introduction on frames, see help frames intro.
Stata supports up to 100 frames. Just like individual datasets, all frames are kept entirely in memory. This makes working with frames, too, very fast. But it assumes you can fit all frames data in memory, a constraint that motivated the two new features in Stata 18.
New in Stata 18: Framesets
Stata 18 adds a natural evolution of the frames concept: users can now save on disk, in a memory-efficient way, a set of frames. A new data file format is introduced for framesets: .dtas, the plural of .dta.
For example, let’s create three frames and load into them three different datasets (related to life expectancy):
. frame create life0 . frame create life1 . frame create life2 . frame life0: sysuse lifeexp (Life expectancy, 1998) . frame life1: sysuse uslifeexp (U.S. life expectancy, 1900-1999) . frame life2: sysuse uslifeexp2 (U.S. life expectancy, 1900-1940)
You can save these three frames in one frameset file, say, life.dtas, with
. frames save life, frames(life0 life1 life2) file life.dtas saved
You can later reset or clear all frames and load frames saved in life.dtas with
. frames reset . frames use life life0 68 x 6; Life expectancy, 1998 life1 100 x 10; U.S. life expectancy, 1900-1999 life2 41 x 2; U.S. life expectancy, 1900-1940
When working with a set of frames, you have to consider a number of factors. For example, what if the frames you want to load from disk have the same names as those in memory? Which frame becomes the current frame when a frameset is loaded? What if you try to load a previously linked frame that does not exist anymore?
I provide frames describe, which takes stock of frames and the variables they hold, both in memory and on disk. For example, the following gives a (short) description of frames in frameset life.dtas:
. frames describe using life, short ------------------------------------------------------------------------------- Frame: life0 Contains data Life expectancy, 1998 Observations: 68 26 Aug 2023 20:06 Variables: 6 Sorted by: ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Frame: life1 Contains data U.S. life expectancy, 1900-1999 Observations: 100 26 Aug 2023 20:06 Variables: 10 Sorted by: year ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Frame: life2 Contains data U.S. life expectancy, 1900-1940 Observations: 41 26 Aug 2023 20:06 Variables: 2 Sorted by: year -------------------------------------------------------------------------------
Frameset commands also store numerous r-results to keep track of what is happening, for example, the subset of frames being saved or loaded, whether data in each frame has changed in memory, and so on.
As with .dta files, we provide the low-level description of .dtas files. help dtas provides all the details needed to have other software read and write .dtas files.
The syntax and options of frameset commands follow, quite naturally, those of dataset commands, like save, use, and describe. For example, dataset and frameset commands handle, in the same way, things like labels, empty datasets, the level of detail in describing datasets, etc.
Stata uses its native zipfile to compress frameset files in frames save, and unzipfile to extract files in frames use. The user can specify the compression level for frames save. This can be done in two ways: through the complevel(#) option or through set dtascomplevel #. # is an integer between 0 and 9—0 means no compression and 9 means maximum compression. The default is 1. For example, life.dtas can be saved and replaced on disk with maximum compression by typing
. frames save life, frames(life0 life1 life2) complevel(9) replace file life.dtas saved
Note that frames and framesets are built on top of datasets. That means you can keep working with datasets in exactly the same way as you did before, if frames and framesets are not of practical interest to you. The only thing you probably need to know is that when you use a dataset, it goes into a frame by default—and this frame is, unsurprisingly, named default. At the end of the day, even with frames, you interactively work with one dataset or one frame at any given time.
New in Stata 18: Alias variables across frames
In this section, I describe how alias variables can be used to access variables across frames in a memory-efficient manner.
Two datasets in different frames can be related by having matching variables. As mentioned earlier, you can link frames with frlink by matching observations in the current frame with observations in the related frame, based on common variables.
After creating links with frlink, you can use fralias add to define variable aliases—names that reference variables in a linked frame.
Here is an example of adding an alias variable. First, let’s set up the auto and cars frames in memory as we did above.
. clear all . frame create auto . frame change auto . sysuse auto (1978 automobile data) . frame copy auto cars . frame cars: generate newvar = runiform() . pwf (current frame is auto)
The two frames are the same, except for variable newvar added to cars. From the current frame auto, you can create a one-to-one link with cars, based on common variable make:
. frlink 1:1 make, frame(cars) (all observations in frame auto matched)
Now, an alias variable, say, newvar, can be created in current frame auto to access variable newvar in cars:
. fralias add newvar, from(cars) (1 variable aliased from linked frame)
Here the alias variable has the same name as the variable it points to. But it can be different. We’ll show how in the next example.
In essence, fralias add defines references from the current frame to variables in linked frames. The references enable you to work with the linked variables without copying them in the current frame. These references consume very little memory; the variables are actually stored only in one frame or dataset but can be made available in different frames.
Here are a few more comments about frlink, on which fralias is predicated. When you use frlink, a new variable is created in the current frame. It references the linked frame. By default, the new variable is named after the linked frame. But a different variable name can be generated with option generate().
Also, the matching of observations with common variables done by frlink can be one to one or many to one. Rather usefully, frlink will also handle variables that are common in different frames but with different names. Furthermore, frlink can match groups of variables using wildcard * in variable names. Should there be changes in data, or frames renamed, links can be rebuilt with frlink rebuild or dropped by dropping the link variable.
Alias variables created by fralias add are treated like any other variable in your dataset, with the caveat that you are not allowed to change their values. For a given alias variable, if you change the corresponding variable’s values in the linked frame where they reside, the changed values are automatically available the next time you use the alias variable. So changing the variables in one frame is sufficient, and the change is reflected in all frames that reference them.
Alias variables allow many frames to have the same variable as if it belongs to all of them, but the variable is stored in only one frame. This avoids creating duplicates of variables or using expensive commands like merge or frget. The latter, for example, copies variables from a linked frame with a large memory footprint, especially with expensive data types like double and string. In contrast, alias variables, being mere references in memory, have small, fixed memory footprints. Using alias variables is therefore memory efficient and helps afford holding all frames in memory, which keeps Stata quick and nimble.
Example of frameset and alias variable
In this section, I provide a more complete example and delve into additional features of frameset and alias variable commands.
Suppose you are working on a project about the income level in the state of Texas in the United States and would like to analyze the data at person and county level (each United States state comprises counties).
You are using two Stata datasets: persons.dta and txcounty.dta. You can load the two datasets in two frames, say, persons and counties, as follows:
. clear all . frame create persons . frame change persons . webuse persons . frame create counties . frame change counties . webuse txcounty (Median income in Texas counties)
You can describe the two frames with the frame prefix:
. frame persons: describe Contains data from https://www.stata-press.com/data/r18/persons.dta Observations: 20 Variables: 3 16 Apr 2022 13:36 (_dta has notes) ---------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------- personid byte %9.0g Person ID countyid byte %9.0g County ID income float %9.0g Household income ---------------------------------------------------------------------- Sorted by: . frame counties: describe Contains data from https://www.stata-press.com/data/r18/txcounty.dta Observations: 8 Median income in Texas counties Variables: 2 30 Dec 2022 06:13 (_dta has notes) ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- countyid byte %9.0g cty County ID median_income float %9.0g Household median income ------------------------------------------------------------------------------- Sorted by:
With clear all above, we automatically started with an empty working frame called default. We then added two frames on top of default. We can list the frames in memory and identify the current frame with
. frames dir counties 8 x 2; Median income in Texas counties default 0 x 0 persons 20 x 3; persons.dta . pwf (current frame is counties)
counties is the current frame because it is the last frame we changed to. If we want to work with persons, we have to change to that frame:
. frame change persons
Because frames persons and counties have common variable countyid, we can use frlink to link current frame persons to frame counties, based on countyid. Because many persons belong to the same county, the matching here is many to one (m:1):
. frlink m:1 countyid, frame(counties) (all observations in frame persons matched)
The matching variables do not have to have the same name. It is rather straightfoward to do the linking in such cases. help frlink has the details.
Note that the frlink command above created a new variable in persons and is named counties. It is named after the linked frame. Option generate() could have been used in frlink to create a different variable name. The values of the new variable are matching observation numbers in counties.
You can now use frames save to save, on disk, frame persons and all other frames linked to it by specifying option linked; all frames are saved in file myproject.dtas:
. frames save myproject, frames(persons) linked file myproject.dtas saved
Note that only frame counties is linked to the current frame in this case, given the frlink command above. So counties is also saved in myproject.dtas, besides persons.
Next, you can reset all frames in memory and later remind yourself what is there in myproject.dtas with frames describe (we use option simple for a compact description):
. frames reset . frames describe using myproject, simple -------------------------------------------- Frame: persons personid countyid income counties -------------------------------------------- -------------------------------------------- Frame: counties countyid median_income --------------------------------------------
You can later load all frames saved in myproject.dtas in memory with frames use:
. frames use myproject, frames(_all) counties 8 x 2; Median income in Texas counties persons 20 x 4
Note that, at this point, the current frame is default, as pwf reveals:
. pwf (current frame is default)
Even though two frames have been loaded in memory, the current frame (default in this case) did not change with frames use. To work with one of the loaded frames, say, persons, you have to explicitly specify it as the working frame:
. frame change persons
Next, you would like to compare incomes of individual persons to the median income of the county. The median income is available in frame counties. We know that persons is linked to counties based on the frlink command above. We can verify the existing linkages from the current frame (persons) with
. frlink dir (1 frlink variable found) ----------------------------------------------------------------------------- counties created by frlink m:1 countyid, frame(counties) ----------------------------------------------------------------------------- Note: Type "frlink describe varname" to find out more, including whether the variable is still valid.
To access variable median_income in frame counties, you can add an alias variable, say, median, to reference the variable as follows:
. fralias add median = median_income, from(counties) (1 variable aliased from linked frame)
You can describe the alias variable with
. fralias describe median ---------------------------------------------------- Alias Type Target Link Frame ---------------------------------------------------- median float median_income counties counties ----------------------------------------------------
You can now run analyses in frame persons that include variable median. Very simply here, you can find the ratio of individual income to the corresponding county median income:
. generate ratio = income/median
Note that alias variable median merely references median_income in counties, which consumes little memory. So you can work with the variable as if it were part of the frame, with very little memory overhead. But you cannot change the variable; it can be changed only in frame counties. Any change in the variable will be available in all frames that reference it.
Summary
In this post, I described two data management features recently introduced in Stata: framesets and alias variables. While sticking to basic principles that make data processing in Stata simple, intuitive, and fast (like keeping the entire dataset in memory), we kept augmenting Stata’s capabilities in data management. The capability to handle large and complex datasets look a leap in Stata 16 with the introduction of frames: multiple, and presumably related, datasets can be simultaneously kept in memory as frames. In Stata 18, we followed up with a natural evolution of frames: the ability to save multiple datasets or frames in a single, compressed file and later restore the saved frames in memory. We introduced a new file format, the .dtas file. We also provided alias variables that enable access to variables in linked frames. Alias variables across frames is a powerful tool that conveniently and efficiently allows access to variables in different frames without spending memory by generating copies of the variables or using expensive commands to combine very big datasets.
Appendix: A summary of Stata’s data management capabilities
In this section, I describe the key stations along Stata’s journey in data management. This provides a context for the recent features introduced. While the core design and principles that make Stata intuitive and fast have not changed, the data management capabilities have consistently made major strides to handle increasingly large and complex datasets. Here are the highlights.
- The limit in the number of observations had grown steadily, to over a billion in the MP (multiprocessor) version of Stata 14 (2015) and to over a trillion currently. Terabytes of memory are now supported. The maximum number of variables was increased to 120,000 in Stata/MP 15 (2017). Other data maxima also kept growing: length of various names, number of options for a command, length of value labels, macros, etc. In practice, with automatic memory management introduced in Stata 12 (2011) and enhanced in Stata 14 (2015), the maximum size limits, like the number of observations and variables, are essentially constrained by how much memory is available. help limits will tell you more.
- Before Stata 13 (2013), strings were limited to 244 characters. Stata 13 introduced a new data type called strL (long strings), which increased the maximum length of strings to 2 billion characters. This enabled reading large files into strings and writing long strings to files. Thus, a variety of files could be handled in Stata commands and functions: Word documents, JPEG images, plain text ASCII, EBCDIC, binary, VARCHARs (variable character fields), BLOBs (binary large objects), CLOBs (character large object strings), and more. help datatypes has more details.
- Stata gradually introduced support for an increasingly wide variety of specialized data: longitudinal/panel, survival/duration, time series, survey, discrete choice, spatial, and multiple imputations (to handle missing data).
- All editions of Stata are available on all major operating systems and hardware platforms—with full compatibility. Stata datasets, programs, and other data can be shared across editions and platforms without translation.
- With Stata/MP, the multiprocessor edition of Stata introduced in version 9 (2005), massive speedup was achieved. Up to 64 cores/processors can be supported. Subsequently, many commands and built-in routines have been modified to take advantage of parallelization, wherever possible—from data management tasks like adding variables and sorting to analysis tasks like regression and other computationally intense estimation commands.
- Data can be imported from, and exported to, a growing number of popular file formats, including Excel, SAS, SPSS, dBase—besides standard formats like comma-separated values (.csv) and fixed column data. Stata provides support for JDBC and ODBC and database products like Oracle, MySQL, Amazon Redshift, Snowflake, Microsoft SQL Server, and DB2. Stata also provides access to data repositories like the Federal Reserve Economic Data, Wharton Research Data Services, Haver Analytics, International Statistical Classification of Diseases and Related Health Problems (ICD-9 and ICD-10).
- Stata’s interoperability capabilities also made significant inroads. There has been growing integration with other development platforms like Java, Python, and H2O (for machine learning and predictive analytics). Stata became web aware in version 8 (2003) with commands like webuse. Thereafter, Stata made strides to seamlessly and efficiently access and interoperate with data sources and platforms over the Internet and the Cloud.
- The graphics-driven data editor was introduced in Stata 8 (2003) and then consistently improved. Spreadsheet editing capabilities, like live view of data, adding and changing observations/variables/cells, importing data, and copying and pasting, have been continuously enhanced.
- Stata 14 (2015) introduced support for Unicode (UTF-8). Subsequently, Stata added support for several languages in its interface, menus, and dialogs. Besides English, Stata speaks Chinese, Japanese, Korean, Spanish, and Swedish.
- Mata is a programming language introduced in Stata 9 (2005) with powerful matrix capabilities. The matrices may contain portions of or entire datasets. In fact, Mata matrices can be made with views of Stata datasets and frames and can have up to 281 trillion rows and columns, if the computer has sufficient memory. Mata is compiled and is very efficient. It can run up to 40 times faster than Stata’s interpreted languages and is useful for CPU and memory-intensive numerical methods involving large vectors and matrices.
- To handle larger and more complex projects, Stata introduced a Project Manager in release 13 (2013) to organize data and analysis files under multiple projects.
- Stata 16 (2019) introduced frames. This provides the ability to keep multiple datasets in memory and work with them concurrently. Building up on the framework for frames, two new features (the focus of this blog) were introduced in Stata 18 (2023): the ability to save and load sets of frames (or framesets) and the ability to access variables in different frames through alias variables.
Reference
Cox, N. J. 2015. A short history of Stata on its 30th anniversary. In Thirty Years with Stata: A Retrospective, ed. E. Pinzon, 135–147. College Station, TX: Stata Press.
Resources
[D] frames
[D] frames intro
[D] frames save
[D] frames use
[D] frames describe
set dtascomplevel
[D] frlink
[D] fralias
https://www.stata.com/new-in-stata/frameset/
https://www.stata.com/new-in-stata/alias-variables-across-frames/
https://www.stata.com/features/overview/multiple-datasets-in-memory/
https://www.stata.com/features/data-management/