Essential tools for data quality checks
Before we fit statistical models with our datasets, we typically go through a few checks to confirm that our data are accurate and complete. Regardless of whether you have obtained data from an organization or built the dataset yourself, it is worthwhile to check for data entry errors. Below, we will show you four essential Stata commands for performing quality checks on your data: duplicates, isid, assert, and misstable.
Duplicates
We have fictional data on patients that underwent corrective eye surgery. For each patient we have an identification number, the date they were admitted for surgery, their age and sex, and their systolic blood pressure.
. use datacheck1
. describe
Contains data from datacheck1.dta
Observations: 19
Variables: 7 22 Apr 2026 12:06
-----------------------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-----------------------------------------------------------------------------------------------------------------------------------
patient_id float %9.0g Patient ID
sex float %9.0g Sex
age float %9.0g Age
surgery_date float %td Surgery date
birth_date float %td Birth date
bpsystol float %9.0g Systolic BP
highbp float %9.0g BP 160+
-----------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Patients can have the surgery only once, with an enhancement years later. Therefore, we first want to make sure that our information on patient IDs and dates is correct. We begin by checking for any duplicate observations.
. duplicates report
Duplicates in terms of all variables
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 15 0
2 | 4 2
--------------------------------------
Out of the 19 observations in this dataset, 15 are unique. For these 15 observations, we have a single copy of the information. However, there are four observations that are duplicates. There are two patients for which we have two copies. We list them below:
. duplicates list
Duplicates in terms of all variables
+--------------------------------------------------------------------------------+
| Group Obs patien~d sex age surgery~e birth_d~e bpsystol highbp |
|--------------------------------------------------------------------------------|
| 1 4 3 1 51 09sep2025 13aug1974 135 0 |
| 1 5 3 1 51 09sep2025 13aug1974 135 0 |
| 2 8 6 1 38 18nov2025 10oct1987 125 0 |
| 2 9 6 1 38 18nov2025 10oct1987 125 0 |
+--------------------------------------------------------------------------------+
We have two copies of patient IDs 3 and 6; we can see that the same information is repeated for all variables. Below, we drop the duplicates.
. duplicates drop
Duplicates in terms of all variables
(2 observations deleted)
A community-contributed command that is also useful is distinct; this command will report the number of distinct values for one or more variables. You can also report the number of distinct groups defined by multiple variables, such as the number of unique groups defined by patient ID and surgery date. Type search distinct to learn more, and follow the instructions to install it if you would like to use this command.
Unique identifiers
With the duplicates removed, we now check whether observations are uniquely identified by the combination of patient ID and surgery date; if they are, isid will report nothing. We use the prefix capture to capture a return code in case isid does produce an error; this is useful when placed in do-files because it allows your do-file to continue to run despite any errors. We also use the noisily prefix so we can see the error message.
. capture noisily: isid patient_id surgery_date, sort
variables patient_id and surgery_date do not uniquely identify the observations
We see that patient_id and surgery_date do not uniquely identify observations. Let’s check whether we have any duplicates for patient ID:
. duplicates report patient_id
Duplicates in terms of patient_id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 13 0
2 | 4 2
--------------------------------------
These duplicates are observations that have the same value for patient ID but different values for other variables; otherwise, they would have been reported in our prior call to duplicates report. Let’s take a closer look at the duplicates:
. duplicates list patient_id
Duplicates in terms of patient_id
+------------------------+
| Group Obs patien~d |
|------------------------|
| 1 1 1 |
| 1 2 1 |
| 2 10 9 |
| 2 11 9 |
+------------------------+
. list if patient_id == 1 | patient_id == 9, abbrev(14)
+------------------------------------------------------------------------+
| patient_id sex age surgery_date birth_date bpsystol highbp |
|------------------------------------------------------------------------|
1. | 1 0 34 15mar2020 10feb1986 163 1 |
2. | 1 0 39 20mar2025 10feb1986 165 1 |
10. | 9 1 45 25sep2025 20jun1980 140 0 |
11. | 9 0 47 25sep2025 17jul1978 135 0 |
+------------------------------------------------------------------------+
Observations 1 and 2 both have patient IDs equal to 1; they have the same value for birth_date and sex. This seems to be the same patient; they originally had surgery in 2020 and visited in 2025 for a touch-up. Therefore, these two observations are duplicates for patient_id but not truly duplicates because they differ for other variables, like age and surgery_date. For some data applications, you may want to drop these types of observations; you could do so by typing the following:
duplicates drop patient_id, force
The force option is required here because you are dropping observations that are duplicates in terms of one variable but that are unique based on values of other variables. If we were to issue this command, we would be losing information about this patient’s enhancement surgery, which we don’t want to do; therefore, be aware that you are losing data when dropping these types of observations.
We also see that observations 10 and 11 both have patient IDs equal to 9. They have the same surgery date but different values for birth date and sex, so this seems to be a data entry error. We need to change the patient ID for one of these observations to another value; let’s check the current range of ID numbers.
. codebook patient_id
-----------------------------------------------------------------------------------------------------------------------------------
patient_id Patient ID
-----------------------------------------------------------------------------------------------------------------------------------
Type: Numeric (float)
Range: [1,15] Units: 1
Unique values: 15 Missing .: 0/17
Mean: 7.64706
Std. dev.: 4.52688
Percentiles: 10% 25% 50% 75% 90%
1 4 8 11 14
. replace patient_id = 16 in 11
(1 real change made)
We have patient IDs ranging from 1 to 15. To make sure that the ID number is unique to each patient, we can change the patient ID to 0 or 16; we choose 16.
codebook is useful for checking the range, units, and number of missing values for a variable. If you want a closer look at the frequency for each value, consider using fre; this community-contributed command creates one-way frequency tables, and it is especially useful if you are using value labels. For example, you might want to check how many observations there are per county; fre would display the county number and label, such as ‘‘1 Los Angeles’’ and ‘‘2 Bronx’’. Type search fre to learn more, and follow the instructions to install it if you would like to use this command.
We now run isid once more to confirm that we can uniquely identify each patient.
. isid patient_id surgery_date
Nothing is reported. We can confirm that we have one observation per patient and surgery date.
Verify truth of claim
Next, we want to make sure that our variable highbp was coded correctly. We consider systolic blood pressures of 160, or greater, to be high. Let’s confirm that we have a value of 1 for highbp for observations with a systolic blood pressure of at least 160. We specify the expression that highbp is equal to 1 when bpsystol is greater than or equal to 160; if the assertion is true for all observations, assert will report nothing. However, if it is not true, even for just one observation, the output will let us know that it is false.
. capture noisily: assert highbp == 1 if bpsystol >= 160
2 contradictions in 9 observations
assertion is false
assert checks whether our expression is true for each observation, and it reports that there are 2 contradictions. If you are working with a large dataset, consider using the fast option, which forces assert to stop at the first contradiction. This way, you don’t have to wait while assert checks every observation.
There are two observations for which our expression is false. This might be because systolic blood pressure was in fact low but highbp was mistakenly coded as 1 or because blood pressure was high but highbp was mistakenly coded as 0. We check for both below.
. list if highbp == 1 & bpsystol <= 160
. list if highbp == 0 & bpsystol >= 160
+------------------------------------------------------------------+
| patien~d sex age surgery~e birth_d~e bpsystol highbp |
|------------------------------------------------------------------|
13. | 11 0 24 12dec2025 11nov2001 179 0 |
15. | 13 1 34 26oct2025 15sep1991 . 0 |
+------------------------------------------------------------------+
For observation 13, highbp should instead have been coded as 1. We make that change below.
. replace highbp = 1 in 13
(1 real change made)
Check for missing values
For observation 15, the blood pressure was missing, so highbp should be missing too. Let’s see how many missing values we have in our dataset.
. misstable summarize
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
bpsystol | 2 15 | 12 115 187
-----------------------------------------------------------------------------
The variable bpsystol is the only one with missing values. Let’s make sure that highbp is missing for the other observation for which bpsystol is also missing.
. list if missing(bpsystol)
+------------------------------------------------------------------+
| patien~d sex age surgery~e birth_d~e bpsystol highbp |
|------------------------------------------------------------------|
14. | 12 1 26 14nov2025 10oct1999 . 1 |
15. | 13 1 34 26oct2025 15sep1991 . 0 |
+------------------------------------------------------------------+
We need to replace both values with the system missing value.
. replace highbp = . if bpsystol == .
(2 real changes made, 2 to missing)
With that final change, we check the truth of our claim once more. We assert that highbp is equal to 1 when bpsystol is not missing and greater than or equal to 160.
. assert highbp == 1 if bpsystol >= 160 & bpsystol != .
Our assertion is now true.
That is how you can check for duplicates and missing values and how you can confirm whether you have a unique identifier and whether statements about your data are in fact true.