Home > Data Management > Web scraping NFL data into Stata

Web scraping NFL data into Stata

Football season is around the corner, and I could not be more excited. We have a pretty competitive StataCorp fantasy football league. I’m always looking for an edge in our league, so I challenged one of our interns, Chris Hassell, to write a command to web scrape http://www.nfl.com for data on the NFL. The new command is nfl2stata. To install the command, type

net install http://www.stata.com/users/kcrow/nfl2stata, replace

With this new command, you can easliy find the running backs who had the most touchdowns last season,

. nfl2stata player "running back", season(2017) clear
177 observation(s) loaded

. gsort -touchdowns -yards

. list name team touchdowns in 1/10

     +-------------------------------------+
     |              name   team   touchd~s |
     |-------------------------------------|
  1. |       Todd Gurley     LA         13 |
  2. |       Mark Ingram     NO         12 |
  3. |      Le'Veon Bell    PIT          9 |
  4. |     Jordan Howard    CHI          9 |
  5. | Leonard Fournette    JAX          9 |
     |-------------------------------------|
  6. |       Kareem Hunt     KC          8 |
  7. |     Melvin Gordon    LAC          8 |
  8. |       Carlos Hyde     SF          8 |
  9. |   Latavius Murray    MIN          8 |
 10. |      Alvin Kamara     NO          8 |
     +-------------------------------------+

You can find the top-5 field goal kickers (by field goals made) from last season.

. nfl2stata player "field goal kicker", season(2017) clear
54 observation(s) loaded

. list name team fieldgoalsmade in 1/5

     +--------------------------------------+
     |               name   team   f~lsmade |
     |--------------------------------------|
  1. |       Robbie Gould     SF         39 |
  2. |      Greg Zuerlein     LA         38 |
  3. |    Harrison Butker     KC         38 |
  4. | Stephen Gostkowski     NE         37 |
  5. |        Ryan Succop    TEN         35 |
     +--------------------------------------+

You can generate a graph of the top passing leaders from last regular season.

. nfl2stata player quarterback, season(2017) seasontype(reg) clear
71 observation(s) loaded

. graph bar (asis) yards if yards >= 4000, exclude0                        ///
over(name, sort(yards) descending label(angle(forty_five) labsize(small))) ///
blabel(bar) title(2017 Passing Yard Leaders)

graph1

There is a lot of interesting data to pore through, especially if you’re interested in fantasy football, as I am. Though this seems like a simple command, it actually is not, because of the time it takes to fetch, parse, and load the data from http://www.nfl.com via web scraping.

Web scraping

You may have heard of the term “web scraping”. A simple definition of web scraping is extracting data from websites. Most of the time, a website’s copyright prevents people from distributing data obtained from scaping their website, but you can use a personal copy of the data on your own personal computer. This is what the NFL’s copyright states. Because of this, users must scrape the website themselves. To do this for the NFL data, you type

        nfl2stata scrape, season(_all)

This command will scrape all data from 2009 to the current year and save the data as Stata datasets to your local computer along your Stata adopath. Specifically, it will save them in your PLUS directory where subsequent nfl2stata commands will be able to find them. The first year of NFL data stored on http://www.nfl.com is 2009. Currently, there are no data to scrape before this. Web scraping is an expensive and time-consuming process. Depending on several factors (computer speed, computer memory, network connection, etc.), this initial data scrape can take hours to complete. You might want to run the above command overnight. Once you have scraped the historical data, you can just type

        nfl2stata scrape

Updating your locally stored datasets with the current week’s data does run faster.

As of the writing of this blog, the scraping command works, but if the NFL changes the HTML page format, the command will break, and if this happens, we will fix it if we can. Also, the data that is scraped will change over time as the NFL updates previous data on its site, so sometimes the data you scraped a few weeks ago will not match what you see on the ESPN or NFL website. In addition, sometimes the data can exist in more than one place and can be inconsistent as one site gets updated stats and another does not. You can rescrape the data by using nfl2stata scrape, season(_all) replace to create new clean datasets. These problems are what makes web scraping a volatile process.

Command

The command nfl2stata scrape produces game, game summary, play-by-play, player, player profile, roster, and team Stata datasets for each year. To load those data into Stata, you must use the following commands:

  • To load game-by-game data into Stata, use
            nfl2stata game "position" [, game_summary_options]
    
  • To load game summary data into Stata, use
            nfl2stata gamesummary [, game_summary_options]
    
  • To load play-by-play data into Stata, use
            nfl2stata playbyplay [, playbyplay_options]
    
  • To load player-specific data into Stata, use
            nfl2stata player "position" [, player_options]
    
  • To load player profile data into Stata, use
            nfl2stata profile [, profile_options]
    
  • To load team roster data into Stata, use
            nfl2stata roster [, roster_options]
    
  • To load team game-by-game data into Stata, use
            nfl2stata team [, team_options]
    

These commands each search their respective datasets. Often you will need to use Stata commands like collapse, gsort, and merge to generate the statistics, sort the data, and merge two or more NFL datasets together to examine the data. Let’s look at a few more examples.

Examples

I have found that the two Stata commands I use most frequently with these data are gsort, which sorts data in ascending or descending order, and collapse, which makes a dataset of summary statistics. collapse is especially useful when working with multiple games’ or multiple seasons’ data. For example, to find out which wide receiver led the NFL in receiving last year, you would type

. nfl2stata game "wide receiver", season(2017) seasontype(reg) clear
2764 observation(s) loaded

. collapse (sum) receivingyards, by(name)

. gsort -receivingyards

. list in 1/5

     +----------------------------+
     |            name   receiv~s |
     |----------------------------|
  1. |   Antonio Brown       1533 |
  2. |     Julio Jones       1444 |
  3. |    Keenan Allen       1393 |
  4. | DeAndre Hopkins       1378 |
  5. |    Adam Thielen       1276 |
     +----------------------------+

Sometimes, you need to merge two or more NFL datasets to answer some questions about the data. For example, to find the average weight of an NFL running back over the last nine years, you must merge the roster data and the profile data to get the player position and player weight variables together in the same dataset. For example, type

. nfl2stata roster, clear
18299 observation(s) loaded

. duplicates drop playerid, force

Duplicates in terms of playerid

(13,964 observations deleted)

. drop team teamname seasontype

. save temp_roster.dta, replace
file temp_roster.dta saved

. nfl2stata profile, clear
4335 observation(s) loaded

. merge 1:1 playerid using temp_roster.dta

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                             4,335  (_merge==3)
    -----------------------------------------

. sum weight if position == "RB"

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      weight |        384    215.9036    14.20637        173        269

To find who led the NFL in receiving or rushing you need to merge all offensive player data into one dataset. For example, to list the receiving leaders type

. nfl2stata game "quarterback", season(2017) seasontype(reg) clear
1042 observation(s) loaded

. tempfile tmp 

. qui save "`tmp'", replace

. nfl2stata game "running back", season(2017) seasontype(reg) clear
2018 observation(s) loaded

. qui append using "`tmp'"

. qui save "`tmp'", replace

. nfl2stata game "wide receiver", season(2017) seasontype(reg) clear
2764 observation(s) loaded

. qui append using "`tmp'"

. qui save "`tmp'", replace

. nfl2stata game "tight end", season(2017) seasontype(reg) clear
1554 observation(s) loaded

. qui append using "`tmp'"

. collapse (sum) receivingyards, by(name position)

. gsort -receivingyards

. list name position receivingyards in 1/30

     +-------------------------------------------+
     |                name   position   receiv~s |
     |-------------------------------------------|
  1. |       Antonio Brown         WR       1533 |
  2. |         Julio Jones         WR       1444 |
  3. |        Keenan Allen         WR       1393 |
  4. |     DeAndre Hopkins         WR       1378 |
  5. |        Adam Thielen         WR       1276 |
     |-------------------------------------------|
  6. |      Michael Thomas         WR       1245 |
  7. |         Tyreek Hill         WR       1183 |
  8. |    Larry Fitzgerald         WR       1156 |
  9. |        Marvin Jones         WR       1101 |
 10. |      Rob Gronkowski         TE       1084 |
     |-------------------------------------------|
 11. |       Brandin Cooks         WR       1082 |
 12. |          A.J. Green         WR       1078 |
 13. |        Travis Kelce         TE       1038 |
 14. |         Golden Tate         WR       1003 |
 15. |          Mike Evans         WR       1001 |
     |-------------------------------------------|
 16. |        Doug Baldwin         WR        991 |
 17. |       Jarvis Landry         WR        987 |
 18. |         T.Y. Hilton         WR        966 |
 19. |    Marquise Goodwin         WR        962 |
 20. |    Demaryius Thomas         WR        949 |
     |-------------------------------------------|
 21. |      Robby Anderson         WR        941 |
 22. | JuJu Smith-Schuster         WR        917 |
 23. |       Davante Adams         WR        885 |
 24. |         Cooper Kupp         WR        869 |
 25. |        Stefon Diggs         WR        849 |
     |-------------------------------------------|
 26. |        Kenny Stills         WR        847 |
 27. |      Devin Funchess         WR        840 |
 28. |          Dez Bryant         WR        838 |
 29. |        Alvin Kamara         RB        826 |
 30. |           Zach Ertz         TE        824 |
     +-------------------------------------------+

Implementation

Chris used Stata’s Java plugins to write the majority of the command. The other Java libraries he used to write the command are

There are a lot of Java libraries out there for web scraping data. These are just the ones we used.

Categories: Data Management Tags: ,