Web scraping NFL data into Stata
The nfl2stata command no longer works due to website changes.
Football season is around the corner, and I could not be more excited. We have a pretty competitive StataCorp fantasy football league. I’m always looking for an edge in our league, so I challenged one of our interns, Chris Hassell, to write a command to web scrape http://www.nfl.com for data on the NFL. The new command is nfl2stata. To install the command, type
net install http://www.stata.com/users/kcrow/nfl2stata, replace
With this new command, you can easliy find the running backs who had the most touchdowns last season,
. nfl2stata player "running back", season(2017) clear 177 observation(s) loaded . gsort -touchdowns -yards . list name team touchdowns in 1/10 +-------------------------------------+ | name team touchd~s | |-------------------------------------| 1. | Todd Gurley LA 13 | 2. | Mark Ingram NO 12 | 3. | Le'Veon Bell PIT 9 | 4. | Jordan Howard CHI 9 | 5. | Leonard Fournette JAX 9 | |-------------------------------------| 6. | Kareem Hunt KC 8 | 7. | Melvin Gordon LAC 8 | 8. | Carlos Hyde SF 8 | 9. | Latavius Murray MIN 8 | 10. | Alvin Kamara NO 8 | +-------------------------------------+
You can find the top-5 field goal kickers (by field goals made) from last season.
. nfl2stata player "field goal kicker", season(2017) clear 54 observation(s) loaded . list name team fieldgoalsmade in 1/5 +--------------------------------------+ | name team f~lsmade | |--------------------------------------| 1. | Robbie Gould SF 39 | 2. | Greg Zuerlein LA 38 | 3. | Harrison Butker KC 38 | 4. | Stephen Gostkowski NE 37 | 5. | Ryan Succop TEN 35 | +--------------------------------------+
You can generate a graph of the top passing leaders from last regular season.
. nfl2stata player quarterback, season(2017) seasontype(reg) clear 71 observation(s) loaded . graph bar (asis) yards if yards >= 4000, exclude0 /// over(name, sort(yards) descending label(angle(forty_five) labsize(small))) /// blabel(bar) title(2017 Passing Yard Leaders)
There is a lot of interesting data to pore through, especially if you’re interested in fantasy football, as I am. Though this seems like a simple command, it actually is not, because of the time it takes to fetch, parse, and load the data from http://www.nfl.com via web scraping.
Web scraping
You may have heard of the term “web scraping”. A simple definition of web scraping is extracting data from websites. Most of the time, a website’s copyright prevents people from distributing data obtained from scaping their website, but you can use a personal copy of the data on your own personal computer. This is what the NFL’s copyright states. Because of this, users must scrape the website themselves. To do this for the NFL data, you type
nfl2stata scrape, season(_all)
This command will scrape all data from 2009 to the current year and save the data as Stata datasets to your local computer along your Stata adopath. Specifically, it will save them in your PLUS directory where subsequent nfl2stata commands will be able to find them. The first year of NFL data stored on http://www.nfl.com is 2009. Currently, there are no data to scrape before this. Web scraping is an expensive and time-consuming process. Depending on several factors (computer speed, computer memory, network connection, etc.), this initial data scrape can take hours to complete. You might want to run the above command overnight. Once you have scraped the historical data, you can just type
nfl2stata scrape
Updating your locally stored datasets with the current week’s data does run faster.
As of the writing of this blog, the scraping command works, but if the NFL changes the HTML page format, the command will break, and if this happens, we will fix it if we can. Also, the data that is scraped will change over time as the NFL updates previous data on its site, so sometimes the data you scraped a few weeks ago will not match what you see on the ESPN or NFL website. In addition, sometimes the data can exist in more than one place and can be inconsistent as one site gets updated stats and another does not. You can rescrape the data by using nfl2stata scrape, season(_all) replace to create new clean datasets. These problems are what makes web scraping a volatile process.
Command
The command nfl2stata scrape produces game, game summary, play-by-play, player, player profile, roster, and team Stata datasets for each year. To load those data into Stata, you must use the following commands:
- To load game-by-game data into Stata, use
nfl2stata game "position" [, game_options]
- To load game summary data into Stata, use
nfl2stata gamesummary [, game_summary_options]
- To load play-by-play data into Stata, use
nfl2stata playbyplay [, playbyplay_options]
- To load player-specific data into Stata, use
nfl2stata player "position" [, player_options]
- To load player profile data into Stata, use
nfl2stata profile [, profile_options]
- To load team roster data into Stata, use
nfl2stata roster [, roster_options]
- To load team game-by-game data into Stata, use
nfl2stata team [, team_options]
These commands each search their respective datasets. Often you will need to use Stata commands like collapse, gsort, and merge to generate the statistics, sort the data, and merge two or more NFL datasets together to examine the data. Let’s look at a few more examples.
Examples
I have found that the two Stata commands I use most frequently with these data are gsort, which sorts data in ascending or descending order, and collapse, which makes a dataset of summary statistics. collapse is especially useful when working with multiple games’ or multiple seasons’ data. For example, to find out which wide receiver led the NFL in receiving last year, you would type
. nfl2stata game "wide receiver", season(2017) seasontype(reg) clear 2764 observation(s) loaded . collapse (sum) receivingyards, by(name) . gsort -receivingyards . list in 1/5 +----------------------------+ | name receiv~s | |----------------------------| 1. | Antonio Brown 1533 | 2. | Julio Jones 1444 | 3. | Keenan Allen 1393 | 4. | DeAndre Hopkins 1378 | 5. | Adam Thielen 1276 | +----------------------------+
Sometimes, you need to merge two or more NFL datasets to answer some questions about the data. For example, to find the average weight of an NFL running back over the last nine years, you must merge the roster data and the profile data to get the player position and player weight variables together in the same dataset. For example, type
. nfl2stata roster, clear 18299 observation(s) loaded . duplicates drop playerid, force Duplicates in terms of playerid (13,964 observations deleted) . drop team teamname seasontype . save temp_roster.dta, replace file temp_roster.dta saved . nfl2stata profile, clear 4335 observation(s) loaded . merge 1:1 playerid using temp_roster.dta Result # of obs. ----------------------------------------- not matched 0 matched 4,335 (_merge==3) ----------------------------------------- . sum weight if position == "RB" Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- weight | 384 215.9036 14.20637 173 269
To find who led the NFL in receiving or rushing you need to merge all offensive player data into one dataset. For example, to list the receiving leaders type
. nfl2stata game "quarterback", season(2017) seasontype(reg) clear 1042 observation(s) loaded . tempfile tmp . qui save "`tmp'", replace . nfl2stata game "running back", season(2017) seasontype(reg) clear 2018 observation(s) loaded . qui append using "`tmp'" . qui save "`tmp'", replace . nfl2stata game "wide receiver", season(2017) seasontype(reg) clear 2764 observation(s) loaded . qui append using "`tmp'" . qui save "`tmp'", replace . nfl2stata game "tight end", season(2017) seasontype(reg) clear 1554 observation(s) loaded . qui append using "`tmp'" . collapse (sum) receivingyards, by(name position) . gsort -receivingyards . list name position receivingyards in 1/30 +-------------------------------------------+ | name position receiv~s | |-------------------------------------------| 1. | Antonio Brown WR 1533 | 2. | Julio Jones WR 1444 | 3. | Keenan Allen WR 1393 | 4. | DeAndre Hopkins WR 1378 | 5. | Adam Thielen WR 1276 | |-------------------------------------------| 6. | Michael Thomas WR 1245 | 7. | Tyreek Hill WR 1183 | 8. | Larry Fitzgerald WR 1156 | 9. | Marvin Jones WR 1101 | 10. | Rob Gronkowski TE 1084 | |-------------------------------------------| 11. | Brandin Cooks WR 1082 | 12. | A.J. Green WR 1078 | 13. | Travis Kelce TE 1038 | 14. | Golden Tate WR 1003 | 15. | Mike Evans WR 1001 | |-------------------------------------------| 16. | Doug Baldwin WR 991 | 17. | Jarvis Landry WR 987 | 18. | T.Y. Hilton WR 966 | 19. | Marquise Goodwin WR 962 | 20. | Demaryius Thomas WR 949 | |-------------------------------------------| 21. | Robby Anderson WR 941 | 22. | JuJu Smith-Schuster WR 917 | 23. | Davante Adams WR 885 | 24. | Cooper Kupp WR 869 | 25. | Stefon Diggs WR 849 | |-------------------------------------------| 26. | Kenny Stills WR 847 | 27. | Devin Funchess WR 840 | 28. | Dez Bryant WR 838 | 29. | Alvin Kamara RB 826 | 30. | Zach Ertz TE 824 | +-------------------------------------------+
Implementation
Chris used Stata’s Java plugins to write the majority of the command. The other Java libraries he used to write the command are
There are a lot of Java libraries out there for web scraping data. These are just the ones we used.