Home > Data Management > Web scraping NBA data into Stata

Web scraping NBA data into Stata

Since our intern, Chris Hassell, finished nfl2stata earlier than expected, he went ahead and created another command to web scrape https://stats.nba.com for data on the NBA. The command is nba2stata. To install the command type

net install http://www.stata.com/users/kcrow/nba2stata, replace

When Chris first wrote the command, I knew I wanted to look at how the three-point shot has changed the way the game is played. For example, I can find the best three-point shooter from last season.

. nba2stata playerstats _all, season(2017) seasontype(reg) stat(season) clear
Processing x/543 requests
.........x.........x.........x.........x.........50
.........x.........x.........x.........x.........100
.........x.........x.........x.........x.........150
.........x.........x.........x.........x.........200
.........x.........x.........x.........x.........250
.........x.........x.........x.........x.........300
.........x.........x.........x.........x.........350
.........x.........x.........x.........x.........400
.........x.........x.........x.........x.........450
.........x.........x.........x.........x.........500
.........x.........x.........x.........x...
660 observation(s) loaded

. gsort -threepointfieldgoalsmade

. list playername teamname threepointfieldgoalsmade in 1/10

     +----------------------------------------------------+
     |      playername                teamname   three~de |
     |----------------------------------------------------|
  1. |    James Harden         Houston Rockets        265 |
  2. |     Paul George   Oklahoma City Thunder        244 |
  3. |      Kyle Lowry         Toronto Raptors        238 |
  4. |    Kemba Walker       Charlotte Hornets        231 |
  5. |   Klay Thompson   Golden State Warriors        229 |
     |----------------------------------------------------|
  6. | Wayne Ellington              Miami Heat        227 |
  7. |  Damian Lillard   Portland Trailblazers        227 |
  8. |     Eric Gordon         Houston Rockets        218 |
  9. |   Stephen Curry   Golden State Warriors        212 |
 10. |      Joe Ingles               Utah Jazz        204 |
     +----------------------------------------------------+

Or I can check a player’s regular-season three-point percentage for the last five years.

. nba2stata playerstat "Dirk", stat(season) seasontype(reg) clear
27 observation(s) loaded

. gsort -playerage 

. list playername playerage threepointfieldgoalpercentage in 1/5

     +-------------------------------------+
     |    playername   playe~ge   three~ge |
     |-------------------------------------|
  1. | Dirk Nowitzki         40       .409 |
  2. | Dirk Nowitzki         39       .378 |
  3. | Dirk Nowitzki         38       .368 |
  4. | Dirk Nowitzki         37        .38 |
  5. | Dirk Nowitzki         36       .398 |
     +-------------------------------------+

Or I can see how three-point percentage affects your favorite team’s chance of winning.

. nba2stata teamstats "HOU", season(2017) stat(game) seasontype(reg) clear
82 observation(s) loaded

. keep if threepointfieldgoalpercentage > .35
(35 observations deleted)

. tab winloss

 Win / loss |      Freq.     Percent        Cum.
------------+-----------------------------------
          L |          4        8.51        8.51
          W |         43       91.49      100.00
------------+-----------------------------------
      Total |         47      100.00

nba2stata is great if you are planning on doing pro basketball analysis. Although this command looks identical to nfl2stata, it is not. The command works quite differently.

Web scraping JSON

In our last blog post, we talked about web scraping the https://www.nfl.com and extracting the data from the HTML pages. The NBA data are different. You can access the data via JSON objects from https://stats.nba.com. JSON is a lightweight data format. This data format is easy to parse; therefore, we don’t have a scrape command for these data. We scrape and load these data on the fly.

The NBA’s copyright is similar to that of the NFL; you can use a personal copy of the data on your own personal computer. If you “use, display or publish” anything using these data, you must include “a prominent attribution to http://www.nba.com“. Another difference is that the NBA data stored on http://stats.nba.com can go as far back as the 1960s, depending on the team.

Command

There are only four subcommands to nba2stata, though we could have developed more. Chris had to go back to school.

  • To scrape player statistics data into Stata, use
    nba2stata playerstats name_pattern [, playerstats_options]
    
  • To scrape player profile data into Stata, use
    nba2stata playerprofile name_pattern [, playerprofile_options]
    
  • To scrape team statistics data into Stata, use
    nba2stata teamstats team_adv [, teamstats_options]
    
  • To scrape team roster data into Stata, use
    nba2stata teamroster team_adv [, teamroster_options]
    

Just like with nfl2stata, you will need to use Stata commands like collapse, gsort, and merge to generate the statistics, sort the data, and merge two or more NBA datasets together to examine the data.

Examples

One thing I’m always curious about is which college teams produce the most NBA players. This is easy to find out using nba2stata, collapse, and gsort.

. nba2stata playerprofile "_all", clear
Processing x/4308 requests
.........x.........x.........x.........x.........50
.........x.........x.........x.........x.........100
.........x.........x.........x.........x.........150
<ouput omitted>
.........x.........x.........x.........x.........4250
.........x.........x.........x.........x.........4300
........
4308 observation(s) loaded

. save playerprofile, replace
(note: file playerprofile.dta not found)
file playerprofile.dta saved

. drop if school == ""
(114 observations deleted)

. gen ct = 1

. collapse (count) ct, by(school)

. gsort -ct

. list in 1/10

     +---------------------+
     |         school   ct |
     |---------------------|
  1. |       Kentucky   97 |
  2. |           UCLA   86 |
  3. | North Carolina   80 |
  4. |           Duke   70 |
  5. |         Kansas   69 |
     |---------------------|
  6. |        Indiana   57 |
  7. |     Notre Dame   55 |
  8. |     Louisville   53 |
  9. |        Arizona   51 |
 10. |       Syracuse   50 |
     +---------------------+

Because of the amount of data fetched, you might want to save the player profile data after fetching it because it does take some time to download. On my machine, it took about an hour. The time largly depends on the amount of data that must be fetched. In the above case, it’s all the player profile data from the NBA.

Another interesting example would be to find the oldest and youngest teams in the NBA. You can use the team roster to do this.

. nba2stata teamroster _all, season(2017) clear
Processing x/30 requests
.........x.........x.........x
502 observation(s) loaded

. collapse (mean) age, by(teamname)

. sort age

. list teamname age in 1/5

     +---------------------------------+
     |              teamname       age |
     |---------------------------------|
  1. |          Phoenix Suns   24.4706 |
  2. | Portland Trailblazers   24.8125 |
  3. |         Chicago Bulls   24.8889 |
  4. |         Atlanta Hawks   25.2222 |
  5. |         Brooklyn Nets   25.3529 |
     +---------------------------------+

. list teamname age in -5/l

     +---------------------------------+
     |              teamname       age |
     |---------------------------------|
 26. |    Washington Wizards     27.75 |
 27. |     San Antonio Spurs   28.3529 |
 28. | Golden State Warriors   28.6667 |
 29. |   Cleveland Cavaliers        29 |
 30. |       Houston Rockets   29.1765 |
     +---------------------------------+

Implementation

Again, Chris used Stata’s Java plugins and Gson to write the majority of the command.

Categories: Data Management Tags: ,