Compatibility and reproducibility
I saw a tweet the other day where someone claimed that StataCorp ensures that the dataset format in Stata X is always different from Stata X-1.
This reminded me of an email I wrote a few years ago to a user who had questions about backward compatibility and reproducibility. I’m going to use large parts of that email in this blog post to share my thoughts on those topics.
I understand the frustration of incompatibilities between software versions. While it may not ease the inevitable difficulties that arise, I would like to explain our efforts in this regard.
There are two concepts for a software developer to consider with respect to compatibility—forward compatibility and backward compatibility. Forward compatibility is all but impossible to achieve, while backward compatibility is achievable with some effort.
Regarding the tweet about dataset formats, nothing could be further from the truth. Stata 16, Stata 15, and Stata 14 share the same dataset format, so there should be no issues with compatibility of datasets between the three most recent versions of Stata. Moreover, as painful as a dataset format change is for users, I can promise you it causes our developers and testers even more pain. We strive to not change our format unless we absolutely have to, both for your sake and for our own.
Although I am biased of course, I believe Stata has the best backward compatibility record of any statistical software. Stata is the only statistical software package, commercial or open source, that I am aware of that has a strong built-in version control system to allow scripts and programs written years ago to continue to work in modern versions of the software.
You can take a do-file written, say, almost 30 years ago in Stata 3, and as long as that do-file is marked with “version 3” at the top, it can be run, as-is, with no modification, in a modern Stata 16. No broken scripts. No broken programs. No additional effort. Stata was designed from its very first version with reproducible research in mind, and we want users to be confident that years down the road, the files they used to produce a particular analysis will continue to work even if they change operating systems or computer architecture and move to a much newer version of Stata.
You don’t have to keep multiple installations of old versions of Stata, hoping they will still run on a modern operating system, to be able to run code from years or decades before. You can simply use a modern Stata, and it will understand any old code or dataset from the past.
Regarding dataset formats, we have always followed three important principles:
1. Never change Stata’s dataset format unless we absolutely have to.
Just because a new Stata version comes out doesn’t mean that we change the dataset format. We only change the dataset format if it is absolutely necessary to support some new feature of the new version. This is why Stata 16 shares the same dataset format as Stata 15 and Stata 14; there were no changes to Stata capabilities that required a format change.
2. Always have complete backward compatibility and cross-platform compatibility.
When Stata 16 came out, it had to be certified to read every dataset format Stata ever produced, all the way back to Stata 1. A modern Stata must always be able to read any dataset produced by any older Stata. In addition, Stata on Windows, Stata on Mac, Stata on Linux (all of which are currently 64-bit systems), and Stata on any other or future hardware platform or operating system must be able to read datasets created on any other hardware platform or operating system, including older legacy 32-bit systems.
We want a researcher who did the original analysis for a journal article back in Stata 4 on Windows 3.1 to be able to run their analysis and load their datasets on an up-to-date Stata 16 running on 64-bit Mac OS or any other operating system we support.
3. When possible, provide forward compatibility for at least the second most recent Stata version.
We do this in two ways, the first of which is always possible. If a version of Stata requires a dataset format change, such as the dataset format change that was necessary back in Stata 14 for its Unicode capability, make sure that version of Stata can save the dataset in memory into at least the previous format. We do this with the “saveold” command. In Stata 14, 15, and 16, we took this a bit further so that the “saveold” command can write datasets not just in Stata 13 format, but in various formats understood by Stata all the way back to Stata 11.
The second way to provide forward compatibility is, when possible, to make the last update to the previous version of Stata able to read datasets created by the most recent version of Stata. For example, the last free update we released to Stata 11 included the ability to read Stata 12 format datasets.
We take reproducibility and compatibility—forward, backward, and cross-platform—seriously.