Archive

Archive for the ‘Data Management’ Category

Data management made easy

Data management and data cleaning are critically important steps in any data analysis. Many of us learned this lesson the hard way. Have you ever fit a model that includes age as a covariate and forgotten to convert the missing value codes of -99 to missing values? I have. Or maybe you overlooked a data entry error that resulted in an age of 354 that should have been 54. I’ve done that too. Careful data management and cleaning can help us avoid these kinds of embarrassing mistakes.

I recently recorded a series of data management videos for the Stata Youtube Channel. You can click on the links below to watch the videos. I included topics that I think are important, but the list is far from exhaustive. If you would like to see videos on additional topics, please leave your suggestion in the comments below.

Data management playlist

You can learn more about these topics and many others in the Data Management Reference Manual.

Importing WRDS data into Stata

Wharton Research Data Services (WRDS) is a leading research platform and business intelligence tool for 400+ corporate, academic, and government researchers. If your institution subscribes to WRDS, you can now easily access WRDS data remotely via Stata’s odbc command. For questions or subscription information click here. Read more…

Importing data with import fred

Introduction

The Federal Reserve Economic Database (FRED), maintained by the Federal Reserve Bank of St. Louis, makes available hundreds of thousands of time-series measuring economic and social outcomes. The new Stata 15 command import fred imports data from this repository.

In this post, I show how to use import fred to import data from FRED. I also discuss some of the metadata that import fred provides that can be useful in data management. I then demonstrate how to use an advanced feature: importing multiple revisions of series whose observations are updated over time. Read more…

Categories: Data Management Tags: ,

Importing Twitter data into Stata

In the past, we’ve had users ask if Stata could import Twitter data. So we asked one of our interns, Dawson Deere (currently working on his computer science degree at Texas A&M University) to see if he could write a new command to do this. He used Stata 15’s improved Java plugins feature to write a new twitter2stata command. To install twitter2stata, type

ssc install twitter2stata, replace

Read more…

Categories: Data Management Tags: ,

Handling gaps in time series using business calendars

Time-series data, such as financial data, often have known gaps because there are no observations on days such as weekends or holidays. Using regular Stata datetime formats with time-series data that have gaps can result in misleading analysis. Rather than treating these gaps as missing values, we should adjust our calculations appropriately. I illustrate a convenient way to work with irregularly spaced dates by using Stata’s business calendars.

In nasdaq.dta, I have daily data on Read more…

A tour of datetime in Stata

Converting a string date

Stata has a wide array of tools to work with dates. You can have dates in years, months, or even milliseconds. In this post, I will provide a brief tour of working with dates that will help you get started using all of Stata’s tools.

When you load a dataset, you will notice that every variable has a display format. For date variables, the display format is %td for daily dates, %tm for monthly dates, etc. Let’s load the wpi1 dataset as Read more…

Using import excel with real world data

Stata 12’s new import excel command can help you easily import real-world Excel files into Stata. Excel files often contain header and footer information in the first few and last few rows of a sheet, and you may not want that information loaded. Also, the column labels used in the sheet are invalid Stata variable names and therefore cannot be loaded. Both of these issues can be easily solved using import excel. Read more…

Categories: Data Management Tags: ,

The next leap second will be on June 30th, maybe

Leap seconds are the extra seconds inserted every so often to keep precise atomic clocks better synchronized with the rotation of the Earth. Scheduled for June 30th is the extra second 23:59:60 inserted between 23:59:59 and 00:00:00. Or maybe not.

Tomorrow or Friday a vote may be held at the International Telecommuncation Union (ITU) meeting in Geneva to abolish the leap second from the definition of UTC (Coordinated Universial Time). Which would mean StataCorp would not have to post an update to Stata to keep the %tC format working correctly. Read more…

Categories: Data Management Tags: ,

Merging data, part 2: Multiple-key merges

Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as

        . merge 1:1 personid using ...

In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as Read more…

Categories: Data Management Tags: ,

Merging data, part 1: Merges gone bad

Merging concerns combining datasets on the same observations to produce a result with more variables. We will call the datasets one.dta and two.dta.

When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this: Read more…

Categories: Data Management Tags: ,