Basketball is a multimillion dollar sports that fascinates millions of people all year round peaking in March Madness each year. Here, we are using play-by-play data on all NBA games on two days in Oct 2017.
Read the data from sample-combined-pbp-stats.csv without downloading the file locally. Each line of the file describes one play in a game.
library(tidyverse)
pbp <- read.csv("https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/sample-combined-pbp-stats.csv")
dim(pbp)
## [1] 2386 44
names(pbp)
## [1] "game_id" "data_set" "date" "a1"
## [5] "a2" "a3" "a4" "a5"
## [9] "h1" "h2" "h3" "h4"
## [13] "h5" "period" "away_score" "home_score"
## [17] "remaining_time" "elapsed" "play_length" "play_id"
## [21] "team" "event_type" "assist" "away"
## [25] "home" "block" "entered" "left"
## [29] "num" "opponent" "outof" "player"
## [33] "points" "possession" "reason" "result"
## [37] "steal" "type" "shot_distance" "original_x"
## [41] "original_y" "converted_x" "converted_y" "description"
game_id
. How many games are recorded in the data and how many plays (number of rows) does a game have on average?length(unique(pbp$game_id))
## [1] 5
pbp %>% group_by(game_id) %>% tally() %>% summary()
## game_id n
## Min. :21700001 Min. :440.0
## 1st Qu.:21700002 1st Qu.:457.0
## Median :21700003 Median :477.0
## Mean :21700003 Mean :477.2
## 3rd Qu.:21700004 3rd Qu.:488.0
## Max. :21700005 Max. :524.0
Basic fact-finding questions: make sure to actually write answers to each question, i.e. there are 49 variables and 2,386 plays overall…
For each game, the variable points
keeps track of the number of points attempted in each play.
dplyr
package to work out the number of points each team scored (check result
). Filter out all plays that are not associated with any of the teams.scores <- pbp %>%
group_by(game_id, team) %>%
filter(result == "made") %>%
summarise(score = sum(points)) %>%
ungroup(team) %>%
mutate(team = forcats::fct_reorder(team, score, mean, na.rm=TRUE))
event_type
and period
)pbp %>%
filter(event_type=="end of period", period == 4) %>%
select(home_score, away_score, game_id)
## home_score away_score game_id
## 1 102 99 21700001
## 2 121 122 21700002
## 3 102 90 21700003
## 4 140 131 21700004
## 5 116 109 21700005
ggplot2
, plot scores by team. Sort teams by their median scores.ggplot(scores, aes(team, score)) +
geom_point() +
coord_flip()
which.max
and which.min
are doing).winners <- scores %>%
group_by(game_id) %>%
mutate(winner = team[which.max(score)],
loser = team[which.min(score)])
player
keeps track of which player makes a shot. For each game, identify which player in each team scored the most points.players <- pbp %>%
group_by(game_id, team, player) %>%
filter(result == "made") %>%
summarise(score = sum(points)) %>%
ungroup(player) %>%
group_by(game_id, team) %>%
mutate(mvp = rank(-score)) %>%
filter(mvp == 1)
players
## # A tibble: 10 x 5
## # Groups: game_id, team [10]
## game_id team player score mvp
## <int> <fct> <fct> <int> <dbl>
## 1 21700001 BOS Jaylen Brown 25 1
## 2 21700001 CLE LeBron James 29 1
## 3 21700002 GSW Nick Young 23 1
## 4 21700002 HOU James Harden 27 1
## 5 21700003 CHA Kemba Walker 24 1
## 6 21700003 DET Tobias Harris 27 1
## 7 21700004 BKN D'Angelo Russell 30 1
## 8 21700004 IND Victor Oladipo 22 1
## 9 21700005 MIA Hassan Whiteside 26 1
## 10 21700005 ORL Evan Fournier 23 1
James LeBron has only the second highest score in these four games.
The variables converted_x
and converted_y
give the location of the acting player on the court. Check with the variable event_type
to see, for which types of play we have geographic information.
pbp %>%
mutate(has_x = !is.na(converted_x),
has_y = !is.na(converted_y)
) %>%
group_by(event_type, has_x, has_y) %>%
tally()
## # A tibble: 13 x 4
## # Groups: event_type, has_x [13]
## event_type has_x has_y n
## <fct> <lgl> <lgl> <int>
## 1 end of period FALSE FALSE 20
## 2 foul FALSE FALSE 221
## 3 free throw FALSE FALSE 241
## 4 jump ball FALSE FALSE 5
## 5 miss TRUE TRUE 486
## 6 rebound FALSE FALSE 532
## 7 shot TRUE TRUE 419
## 8 start of period FALSE FALSE 20
## 9 sub FALSE FALSE 221
## 10 timeout FALSE FALSE 55
## 11 turnover FALSE FALSE 147
## 12 unknown FALSE FALSE 7
## 13 violation FALSE FALSE 12
Shots and mises have geographic information
For game number 21700003
plot the geographic location of each play on the court, colour by team and incorporate visually the play’s outcome (variable result
). Describe the result.
pbp %>%
filter(game_id==21700003) %>%
ggplot(aes(x = converted_x, y = converted_y, colour=team, shape=result)) +
geom_point() +
coord_equal()
We see a clear pattern in the shots made and attempted. There are two tight clusters of shots attempted at the baskets, which spread out as distance increases. At the 3 point lines we can see another set of attempted shots.
Is there any evidence that shots closer to the basket are successful more often? For that, - introduce a new variable d
into the data that captures the distance of a player from the basket (basket is in [0,0] for variables original_x
and original_y
). - draw side-by-side boxplots of distance by result and team (using ggplot2). Interpret.
pbp$d <- with(pbp, sqrt(original_x^2 + original_y^2))
pbp %>%
filter(team != "") %>%
ggplot(aes(x = result, y = d)) +
facet_grid(.~team) +
geom_boxplot()
there is a clear difference in the distance associated with shots made and shots missed. Distance does hurt. Shots that were made tend to be closer to the basket.
The variable elapsed
is recorded in hour-minute-second format. Convert the information into seconds (hint: you could introduce helper variables for the conversion). The elapsed
time variable starts over in each period (a period has 12 mins). Introduce a new variable called time_played
that keeps track of the time (in seconds) from the beginning of a game to the end.
library(lubridate)
pbp <- pbp %>%
separate(elapsed, into=c("hour", "minute", "second"), sep=":") %>%
mutate(
elapsed = as.numeric(minute)*60+as.numeric(second),
time_played = elapsed + 12*60*(period-1)
)
Plot the timeline (time_played
) of events (event_type
) for game with the id 21700002
in a scatterplot. Colour by team. Comment on the result.
pbp %>%
filter(game_id=="21700002") %>%
ggplot(aes(x = elapsed, y = event_type, colour=team)) +
geom_point() +
facet_grid(period~.)
There should be clear changes between home and away team showing teams’ plays, but on this scale this is hard to see.
Variables h1
through h5
and a1
through a5
are the five players of the home team and the away team on the field at that moment in positions 1 through 5. Reshape the data set to combine all player variables. Extract position numbers. For each game identify how many players were playing in each position.
Draw side-by-side boxplots of the number of players by position. Comment.
players <- pbp %>%
pivot_longer(names_to = "teamposition", values_to = "players", cols = a1:h5) %>%
mutate(position = parse_number(teamposition)) %>%
group_by(position, game_id) %>%
summarise(num_players = length(unique(players)))
players %>%
ggplot(aes(x = factor(position), y=num_players)) +
geom_boxplot()
Quite a few players are exchanged in each position throughout a game (probably about half are from each of the teams - ideally we should distinguish between home and away team). Position #1 has the fewest players, followed by position #2. The other positions see more changes.