Essential tools for data quality checks

Home > Data Management > Essential tools for data quality checks

Essential tools for data quality checks

13 May 2026 Gabriela Ortiz, Senior Applied Econometrician, and Hua Peng, Executive Director of Software Engineering and Data Science Go to comments

Before we fit statistical models with our datasets, we typically go through a few checks to confirm that our data are accurate and complete. Regardless of whether you have obtained data from an organization or built the dataset yourself, it is worthwhile to check for data entry errors. Below, we will show you four essential Stata commands for performing quality checks on your data: duplicates, isid, assert, and misstable.

Duplicates

We have fictional data on patients that underwent corrective eye surgery. For each patient we have an identification number, the date they were admitted for surgery, their age and sex, and their systolic blood pressure.

. use datacheck1

. describe

Contains data from datacheck1.dta
 Observations:            19
    Variables:             7                  22 Apr 2026 12:06
-----------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-----------------------------------------------------------------------------------------------------------------------------------
patient_id      float   %9.0g                 Patient ID
sex             float   %9.0g                 Sex
age             float   %9.0g                 Age
surgery_date    float   %td                   Surgery date
birth_date      float   %td                   Birth date
bpsystol        float   %9.0g                 Systolic BP
highbp          float   %9.0g                 BP 160+
-----------------------------------------------------------------------------------------------------------------------------------
Sorted by:

Patients can have the surgery only once, with an enhancement years later. Therefore, we first want to make sure that our information on patient IDs and dates is correct. We begin by checking for any duplicate observations.

. duplicates report

Duplicates in terms of all variables

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           15             0
        2 |            4             2
--------------------------------------

Out of the 19 observations in this dataset, 15 are unique. For these 15 observations, we have a single copy of the information. However, there are four observations that are duplicates. There are two patients for which we have two copies. We list them below:

. duplicates list

Duplicates in terms of all variables

  +--------------------------------------------------------------------------------+
  | Group   Obs   patien~d   sex   age   surgery~e   birth_d~e   bpsystol   highbp |
  |--------------------------------------------------------------------------------|
  |     1     4          3     1    51   09sep2025   13aug1974        135        0 |
  |     1     5          3     1    51   09sep2025   13aug1974        135        0 |
  |     2     8          6     1    38   18nov2025   10oct1987        125        0 |
  |     2     9          6     1    38   18nov2025   10oct1987        125        0 |
  +--------------------------------------------------------------------------------+

We have two copies of patient IDs 3 and 6; we can see that the same information is repeated for all variables. Below, we drop the duplicates.

. duplicates drop

Duplicates in terms of all variables

(2 observations deleted)

A community-contributed command that is also useful is distinct; this command will report the number of distinct values for one or more variables. You can also report the number of distinct groups defined by multiple variables, such as the number of unique groups defined by patient ID and surgery date. Type search distinct to learn more, and follow the instructions to install it if you would like to use this command.

Unique identifiers

With the duplicates removed, we now check whether observations are uniquely identified by the combination of patient ID and surgery date; if they are, isid will report nothing. We use the prefix capture to capture a return code in case isid does produce an error; this is useful when placed in do-files because it allows your do-file to continue to run despite any errors. We also use the noisily prefix so we can see the error message.

. capture noisily: isid patient_id surgery_date, sort
variables patient_id and surgery_date do not uniquely identify the observations

We see that patient_id and surgery_date do not uniquely identify observations. Let’s check whether we have any duplicates for patient ID:

. duplicates report patient_id

Duplicates in terms of patient_id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           13             0
        2 |            4             2
--------------------------------------

These duplicates are observations that have the same value for patient ID but different values for other variables; otherwise, they would have been reported in our prior call to duplicates report. Let’s take a closer look at the duplicates:

. duplicates list patient_id 

Duplicates in terms of patient_id

  +------------------------+
  | Group   Obs   patien~d | 
  |------------------------|
  |     1     1          1 | 
  |     1     2          1 | 
  |     2    10          9 |
  |     2    11          9 |
  +------------------------+

. list if patient_id == 1 | patient_id == 9, abbrev(14)

     +------------------------------------------------------------------------+      
     | patient_id   sex   age   surgery_date   birth_date   bpsystol   highbp |
     |------------------------------------------------------------------------|
  1. |          1     0    34      15mar2020    10feb1986        163        1 |
  2. |          1     0    39      20mar2025    10feb1986        165        1 |
 10. |          9     1    45      25sep2025    20jun1980        140        0 |
 11. |          9     0    47      25sep2025    17jul1978        135        0 |
     +------------------------------------------------------------------------+

Observations 1 and 2 both have patient IDs equal to 1; they have the same value for birth_date and sex. This seems to be the same patient; they originally had surgery in 2020 and visited in 2025 for a touch-up. Therefore, these two observations are duplicates for patient_id but not truly duplicates because they differ for other variables, like age and surgery_date. For some data applications, you may want to drop these types of observations; you could do so by typing the following:

duplicates drop patient_id, force

The force option is required here because you are dropping observations that are duplicates in terms of one variable but that are unique based on values of other variables. If we were to issue this command, we would be losing information about this patient’s enhancement surgery, which we don’t want to do; therefore, be aware that you are losing data when dropping these types of observations.

We also see that observations 10 and 11 both have patient IDs equal to 9. They have the same surgery date but different values for birth date and sex, so this seems to be a data entry error. We need to change the patient ID for one of these observations to another value; let’s check the current range of ID numbers.

. codebook patient_id
     
-----------------------------------------------------------------------------------------------------------------------------------
patient_id                                                                                                               Patient ID
-----------------------------------------------------------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [1,15]                        Units: 1 
         Unique values: 15                        Missing .: 0/17

                  Mean: 7.64706
             Std. dev.: 4.52688

           Percentiles:     10%       25%       50%       75%       90%
                              1         4         8        11        14

. replace patient_id = 16 in 11 
(1 real change made)

We have patient IDs ranging from 1 to 15. To make sure that the ID number is unique to each patient, we can change the patient ID to 0 or 16; we choose 16.

codebook is useful for checking the range, units, and number of missing values for a variable. If you want a closer look at the frequency for each value, consider using fre; this community-contributed command creates one-way frequency tables, and it is especially useful if you are using value labels. For example, you might want to check how many observations there are per county; fre would display the county number and label, such as ‘‘1 Los Angeles’’ and ‘‘2 Bronx’’. Type search fre to learn more, and follow the instructions to install it if you would like to use this command.

We now run isid once more to confirm that we can uniquely identify each patient.

. isid patient_id surgery_date

Nothing is reported. We can confirm that we have one observation per patient and surgery date.

Verify truth of claim

Next, we want to make sure that our variable highbp was coded correctly. We consider systolic blood pressures of 160, or greater, to be high. Let’s confirm that we have a value of 1 for highbp for observations with a systolic blood pressure of at least 160. We specify the expression that highbp is equal to 1 when bpsystol is greater than or equal to 160; if the assertion is true for all observations, assert will report nothing. However, if it is not true, even for just one observation, the output will let us know that it is false.

. capture noisily: assert highbp == 1 if bpsystol >= 160
2 contradictions in 9 observations
assertion is false

assert checks whether our expression is true for each observation, and it reports that there are 2 contradictions. If you are working with a large dataset, consider using the fast option, which forces assert to stop at the first contradiction. This way, you don’t have to wait while assert checks every observation.

There are two observations for which our expression is false. This might be because systolic blood pressure was in fact low but highbp was mistakenly coded as 1 or because blood pressure was high but highbp was mistakenly coded as 0. We check for both below.

. list if highbp == 1 & bpsystol <= 160  

. list if highbp == 0 & bpsystol >= 160

     +------------------------------------------------------------------+
     | patien~d   sex   age   surgery~e   birth_d~e   bpsystol   highbp |
     |------------------------------------------------------------------|
 13. |       11     0    24   12dec2025   11nov2001        179        0 |
 15. |       13     1    34   26oct2025   15sep1991          .        0 |
     +------------------------------------------------------------------+

For observation 13, highbp should instead have been coded as 1. We make that change below.

. replace highbp = 1 in 13 
(1 real change made)

Check for missing values

For observation 15, the blood pressure was missing, so highbp should be missing too. Let’s see how many missing values we have in our dataset.

. misstable summarize
                                                               Obs<.
                                                +------------------------------ 
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
      bpsystol |         2                  15  |     12        115         187
  -----------------------------------------------------------------------------

The variable bpsystol is the only one with missing values. Let’s make sure that highbp is missing for the other observation for which bpsystol is also missing.

. list if missing(bpsystol)

     +------------------------------------------------------------------+
     | patien~d   sex   age   surgery~e   birth_d~e   bpsystol   highbp |
     |------------------------------------------------------------------|
 14. |       12     1    26   14nov2025   10oct1999          .        1 |
 15. |       13     1    34   26oct2025   15sep1991          .        0 |
     +------------------------------------------------------------------+

We need to replace both values with the system missing value.

. replace highbp = . if bpsystol == . 
(2 real changes made, 2 to missing)

With that final change, we check the truth of our claim once more. We assert that highbp is equal to 1 when bpsystol is not missing and greater than or equal to 160.

. assert highbp == 1 if bpsystol >= 160 & bpsystol != .

Our assertion is now true.

That is how you can check for duplicates and missing values and how you can confirm whether you have a unique identifier and whether statements about your data are in fact true.

A new update to StataNow has just been released A new update to StataNow has just been released

Essential tools for data quality checks

Subscribe to the Stata Blog

Recent articles

Archives

Categories

Links