Web scraping NBA data into Stata
As of November 2019, this command no longer works because of https://stats.nba.com restrictions.
Since our intern, Chris Hassell, finished nfl2stata earlier than expected, he went ahead and created another command to web scrape https://stats.nba.com for data on the NBA. The command is nba2stata. To install the command type
net install http://www.stata.com/users/kcrow/nba2stata, replace
When Chris first wrote the command, I knew I wanted to look at how the three-point shot has changed the way the game is played. For example, I can find the best three-point shooter from last season.
. nba2stata playerstats _all, season(2017) seasontype(reg) stat(season) clear Processing x/543 requests .........x.........x.........x.........x.........50 .........x.........x.........x.........x.........100 .........x.........x.........x.........x.........150 .........x.........x.........x.........x.........200 .........x.........x.........x.........x.........250 .........x.........x.........x.........x.........300 .........x.........x.........x.........x.........350 .........x.........x.........x.........x.........400 .........x.........x.........x.........x.........450 .........x.........x.........x.........x.........500 .........x.........x.........x.........x... 660 observation(s) loaded . gsort -threepointfieldgoalsmade . list playername teamname threepointfieldgoalsmade in 1/10 +----------------------------------------------------+ | playername teamname three~de | |----------------------------------------------------| 1. | James Harden Houston Rockets 265 | 2. | Paul George Oklahoma City Thunder 244 | 3. | Kyle Lowry Toronto Raptors 238 | 4. | Kemba Walker Charlotte Hornets 231 | 5. | Klay Thompson Golden State Warriors 229 | |----------------------------------------------------| 6. | Wayne Ellington Miami Heat 227 | 7. | Damian Lillard Portland Trailblazers 227 | 8. | Eric Gordon Houston Rockets 218 | 9. | Stephen Curry Golden State Warriors 212 | 10. | Joe Ingles Utah Jazz 204 | +----------------------------------------------------+
Or I can check a player’s regular-season three-point percentage for the last five years.
. nba2stata playerstat "Dirk", stat(season) seasontype(reg) clear 27 observation(s) loaded . gsort -playerage . list playername playerage threepointfieldgoalpercentage in 1/5 +-------------------------------------+ | playername playe~ge three~ge | |-------------------------------------| 1. | Dirk Nowitzki 40 .409 | 2. | Dirk Nowitzki 39 .378 | 3. | Dirk Nowitzki 38 .368 | 4. | Dirk Nowitzki 37 .38 | 5. | Dirk Nowitzki 36 .398 | +-------------------------------------+
Or I can see how three-point percentage affects your favorite team’s chance of winning.
. nba2stata teamstats "HOU", season(2017) stat(game) seasontype(reg) clear 82 observation(s) loaded . keep if threepointfieldgoalpercentage > .35 (35 observations deleted) . tab winloss Win / loss | Freq. Percent Cum. ------------+----------------------------------- L | 4 8.51 8.51 W | 43 91.49 100.00 ------------+----------------------------------- Total | 47 100.00
nba2stata is great if you are planning on doing pro basketball analysis. Although this command looks identical to nfl2stata, it is not. The command works quite differently.
Web scraping JSON
In our last blog post, we talked about web scraping the https://www.nfl.com and extracting the data from the HTML pages. The NBA data are different. You can access the data via JSON objects from https://stats.nba.com. JSON is a lightweight data format. This data format is easy to parse; therefore, we don’t have a scrape command for these data. We scrape and load these data on the fly.
The NBA’s copyright is similar to that of the NFL; you can use a personal copy of the data on your own personal computer. If you “use, display or publish” anything using these data, you must include “a prominent attribution to http://www.nba.com“. Another difference is that the NBA data stored on http://stats.nba.com can go as far back as the 1960s, depending on the team.
Command
There are only four subcommands to nba2stata, though we could have developed more. Chris had to go back to school.
- To scrape player statistics data into Stata, use
nba2stata playerstats name_pattern [, playerstats_options]
- To scrape player profile data into Stata, use
nba2stata playerprofile name_pattern [, playerprofile_options]
- To scrape team statistics data into Stata, use
nba2stata teamstats team_adv [, teamstats_options]
- To scrape team roster data into Stata, use
nba2stata teamroster team_adv [, teamroster_options]
Just like with nfl2stata, you will need to use Stata commands like collapse, gsort, and merge to generate the statistics, sort the data, and merge two or more NBA datasets together to examine the data.
Examples
One thing I’m always curious about is which college teams produce the most NBA players. This is easy to find out using nba2stata, collapse, and gsort.
. nba2stata playerprofile "_all", clear Processing x/4308 requests .........x.........x.........x.........x.........50 .........x.........x.........x.........x.........100 .........x.........x.........x.........x.........150 <ouput omitted> .........x.........x.........x.........x.........4250 .........x.........x.........x.........x.........4300 ........ 4308 observation(s) loaded . save playerprofile, replace (note: file playerprofile.dta not found) file playerprofile.dta saved . drop if school == "" (114 observations deleted) . gen ct = 1 . collapse (count) ct, by(school) . gsort -ct . list in 1/10 +---------------------+ | school ct | |---------------------| 1. | Kentucky 97 | 2. | UCLA 86 | 3. | North Carolina 80 | 4. | Duke 70 | 5. | Kansas 69 | |---------------------| 6. | Indiana 57 | 7. | Notre Dame 55 | 8. | Louisville 53 | 9. | Arizona 51 | 10. | Syracuse 50 | +---------------------+
Because of the amount of data fetched, you might want to save the player profile data after fetching it because it does take some time to download. On my machine, it took about an hour. The time largly depends on the amount of data that must be fetched. In the above case, it’s all the player profile data from the NBA.
Another interesting example would be to find the oldest and youngest teams in the NBA. You can use the team roster to do this.
. nba2stata teamroster _all, season(2017) clear Processing x/30 requests .........x.........x.........x 502 observation(s) loaded . collapse (mean) age, by(teamname) . sort age . list teamname age in 1/5 +---------------------------------+ | teamname age | |---------------------------------| 1. | Phoenix Suns 24.4706 | 2. | Portland Trailblazers 24.8125 | 3. | Chicago Bulls 24.8889 | 4. | Atlanta Hawks 25.2222 | 5. | Brooklyn Nets 25.3529 | +---------------------------------+ . list teamname age in -5/l +---------------------------------+ | teamname age | |---------------------------------| 26. | Washington Wizards 27.75 | 27. | San Antonio Spurs 28.3529 | 28. | Golden State Warriors 28.6667 | 29. | Cleveland Cavaliers 29 | 30. | Houston Rockets 29.1765 | +---------------------------------+
Implementation
Again, Chris used Stata’s Java plugins and Gson to write the majority of the command.