For this YOUR TURN use the HallOfFame
and Master
data from the Lahman
package
Master
data for their player IDs.Load the package and take a look at the metadata for the 24 included data sets. You can also take a look at the first couple of observations in a few of the data sets.
library(tidyverse)
library(Lahman)
LahmanData
## file class nobs nvar title
## 1 AllstarFull data.frame 4993 8 AllstarFull table
## 2 Appearances data.frame 99466 21 Appearances table
## 3 AwardsManagers data.frame 171 6 AwardsManagers table
## 4 AwardsPlayers data.frame 6026 6 AwardsPlayers table
## 5 AwardsShareManagers data.frame 401 7 AwardsShareManagers table
## 6 AwardsSharePlayers data.frame 6705 7 AwardsSharePlayers table
## 7 Batting data.frame 99846 22 Batting table
## 8 BattingPost data.frame 11294 22 BattingPost table
## 9 CollegePlaying data.frame 17350 3 CollegePlaying table
## 10 Fielding data.frame 167938 18 Fielding table
## 11 FieldingOF data.frame 12028 6 FieldingOF table
## 12 FieldingPost data.frame 11924 17 FieldingPost data
## 13 HallOfFame data.frame 4088 9 Hall of Fame Voting Data
## 14 Managers data.frame 3370 10 Managers table
## 15 ManagersHalf data.frame 93 10 ManagersHalf table
## 16 Master data.frame 18589 26 Master table
## 17 Pitching data.frame 43330 30 Pitching table
## 18 PitchingPost data.frame 4945 30 PitchingPost table
## 19 Salaries data.frame 24758 5 Salaries table
## 20 Schools data.frame 1207 5 Schools table
## 21 SeriesPost data.frame 298 9 SeriesPost table
## 22 Teams data.frame 2775 48 Teams table
## 23 TeamsFranchises data.frame 120 4 TeamFranchises table
## 24 TeamsHalf data.frame 52 10 TeamsHalf table
AllstarFull %>% head()
## playerID yearID gameNum gameID teamID lgID GP startingPos
## 1 gomezle01 1933 0 ALS193307060 NYA AL 1 1
## 2 ferreri01 1933 0 ALS193307060 BOS AL 1 2
## 3 gehrilo01 1933 0 ALS193307060 NYA AL 1 3
## 4 gehrich01 1933 0 ALS193307060 DET AL 1 4
## 5 dykesji01 1933 0 ALS193307060 CHA AL 1 5
## 6 cronijo01 1933 0 ALS193307060 WS1 AL 1 6
Master %>% head()
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 aardsda01 1981 12 27 USA CO Denver
## 2 aaronha01 1934 2 5 USA AL Mobile
## 3 aaronto01 1939 8 5 USA AL Mobile
## 4 aasedo01 1954 9 8 USA CA Orange
## 5 abadan01 1972 8 25 USA FL Palm Beach
## 6 abadfe01 1985 12 17 D.R. La Romana La Romana
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> David
## 2 NA NA NA <NA> <NA> <NA> Hank
## 3 1984 8 16 USA GA Atlanta Tommie
## 4 NA NA NA <NA> <NA> <NA> Don
## 5 NA NA NA <NA> <NA> <NA> Andy
## 6 NA NA NA <NA> <NA> <NA> Fernando
## nameLast nameGiven weight height bats throws debut finalGame
## 1 Aardsma David Allan 215 75 R R 2004-04-06 2015-08-23
## 2 Aaron Henry Louis 180 72 R R 1954-04-13 1976-10-03
## 3 Aaron Tommie Lee 190 75 R R 1962-04-10 1971-09-26
## 4 Aase Donald William 190 75 R R 1977-07-26 1990-10-03
## 5 Abad Fausto Andres 184 73 L L 2001-09-10 2006-04-13
## 6 Abad Fernando Antonio 220 73 L L 2010-07-28 2017-10-01
## retroID bbrefID deathDate birthDate
## 1 aardd001 aardsda01 <NA> 1981-12-27
## 2 aaroh101 aaronha01 <NA> 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001 aasedo01 <NA> 1954-09-08
## 5 abada001 abadan01 <NA> 1972-08-25
## 6 abadf001 abadfe01 <NA> 1985-12-17
To answer this question we will use dplyr
verbs to obtain a vector of all of the players inducted into the Hall of Fame in 2017 according to their player ids:
player_ids <- HallOfFame %>% filter(yearID == 2017) %>% pull(playerID)
head(player_ids)
## [1] "bagweje01" "raineti01" "rodriiv01" "hoffmtr01" "guerrvl01" "martied01"
We can then use that vector to filter the Master
data by the player ids:
Master %>% filter(playerID %in% player_ids) %>% str()
## 'data.frame': 36 obs. of 26 variables:
## $ playerID : chr "bagweje01" "blakeca01" "bondsba01" "burrepa01" ...
## $ birthYear : int 1968 1973 1964 1976 1974 1973 1962 1975 1975 1975 ...
## $ birthMonth : int 5 8 7 10 11 1 8 11 2 9 ...
## $ birthDay : int 27 23 24 10 2 8 4 20 9 30 ...
## $ birthCountry: chr "USA" "USA" "USA" "USA" ...
## $ birthState : chr "MA" "IA" "CA" "AR" ...
## $ birthCity : chr "Boston" "Des Moines" "Riverside" "Eureka Springs" ...
## $ deathYear : int NA NA NA NA NA NA NA NA NA NA ...
## $ deathMonth : int NA NA NA NA NA NA NA NA NA NA ...
## $ deathDay : int NA NA NA NA NA NA NA NA NA NA ...
## $ deathCountry: chr NA NA NA NA ...
## $ deathState : chr NA NA NA NA ...
## $ deathCity : chr NA NA NA NA ...
## $ nameFirst : chr "Jeff" "Casey" "Barry" "Pat" ...
## $ nameLast : chr "Bagwell" "Blake" "Bonds" "Burrell" ...
## $ nameGiven : chr "Jeffrey Robert" "William Casey" "Barry Lamar" "Patrick Brian" ...
## $ weight : int 195 205 185 235 195 210 205 200 235 215 ...
## $ height : int 72 74 73 76 70 74 76 73 75 73 ...
## $ bats : Factor w/ 3 levels "B","L","R": 3 3 2 3 3 3 3 2 3 1 ...
## $ throws : Factor w/ 3 levels "L","R","S": 2 2 1 2 2 2 2 2 2 2 ...
## $ debut : chr "1991-04-08" "1999-08-14" "1986-05-30" "2000-05-24" ...
## $ finalGame : chr "2005-10-02" "2011-08-31" "2007-09-26" "2011-09-28" ...
## $ retroID : chr "bagwj001" "blakc001" "bondb001" "burrp001" ...
## $ bbrefID : chr "bagweje01" "blakeca01" "bondsba01" "burrepa01" ...
## $ deathDate : Date, format: NA NA ...
## $ birthDate : Date, format: "1968-05-27" "1973-08-23" ...
Master
data set and the HallOfFame
data.The trick here is to recognize that we want to add the relevant materials from the Master
data to the HallOfFame
data.
dim(HallOfFame)
## [1] 4191 9
fame_master <- HallOfFame %>%
left_join(Master, by="playerID")
dim(fame_master)
## [1] 4191 34
What would happen if we had (incorrectly) reversed this?
master_fame <- Master %>%
left_join(HallOfFame, by="playerID")
dim(master_fame)
## [1] 22529 34
fame_master %>%
filter(inducted=="Y", is.na(deathYear)) %>%
select(nameFirst, nameLast) %>%
arrange(nameLast) %>%
head()
## nameFirst nameLast
## 1 Hank Aaron
## 2 Roberto Alomar
## 3 Luis Aparicio
## 4 Jeff Bagwell
## 5 Johnny Bench
## 6 Craig Biggio
fame_master %>% filter(nameLast == "Sosa")
## playerID yearID votedBy ballots needed votes inducted category needed_note
## 1 sosasa01 2013 BBWAA 569 427 71 N Player <NA>
## 2 sosasa01 2014 BBWAA 571 429 41 N Player <NA>
## 3 sosasa01 2015 BBWAA 549 412 36 N Player <NA>
## 4 sosasa01 2016 BBWAA 440 330 31 N Player <NA>
## 5 sosasa01 2017 BBWAA 442 332 38 N Player <NA>
## 6 sosasa01 2018 BBWAA 422 317 33 N Player <NA>
## birthYear birthMonth birthDay birthCountry birthState
## 1 1968 11 12 D.R. San Pedro de Macoris
## 2 1968 11 12 D.R. San Pedro de Macoris
## 3 1968 11 12 D.R. San Pedro de Macoris
## 4 1968 11 12 D.R. San Pedro de Macoris
## 5 1968 11 12 D.R. San Pedro de Macoris
## 6 1968 11 12 D.R. San Pedro de Macoris
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 San Pedro de Macoris NA NA NA <NA> <NA>
## 2 San Pedro de Macoris NA NA NA <NA> <NA>
## 3 San Pedro de Macoris NA NA NA <NA> <NA>
## 4 San Pedro de Macoris NA NA NA <NA> <NA>
## 5 San Pedro de Macoris NA NA NA <NA> <NA>
## 6 San Pedro de Macoris NA NA NA <NA> <NA>
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 1 <NA> Sammy Sosa Samuel Peralta 165 72 R R
## 2 <NA> Sammy Sosa Samuel Peralta 165 72 R R
## 3 <NA> Sammy Sosa Samuel Peralta 165 72 R R
## 4 <NA> Sammy Sosa Samuel Peralta 165 72 R R
## 5 <NA> Sammy Sosa Samuel Peralta 165 72 R R
## 6 <NA> Sammy Sosa Samuel Peralta 165 72 R R
## debut finalGame retroID bbrefID deathDate birthDate
## 1 1989-06-16 2007-09-29 sosas001 sosasa01 <NA> 1968-11-12
## 2 1989-06-16 2007-09-29 sosas001 sosasa01 <NA> 1968-11-12
## 3 1989-06-16 2007-09-29 sosas001 sosasa01 <NA> 1968-11-12
## 4 1989-06-16 2007-09-29 sosas001 sosasa01 <NA> 1968-11-12
## 5 1989-06-16 2007-09-29 sosas001 sosasa01 <NA> 1968-11-12
## 6 1989-06-16 2007-09-29 sosas001 sosasa01 <NA> 1968-11-12
For this exercise, use the data from the classdata
package
classdata
package into your R session.You will need to install the most current version of the package in order to complete this exercise.
# devtools::install_github("haleyjeppson/classdata", force = TRUE)
library(classdata)
box
and budget
.First, load the data:
data(box)
data(budget)
Take a first look at the datasets:
str(box)
## 'data.frame': 32885 obs. of 11 variables:
## $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Rank.Last.Week: num 1 NA NA 2 3 4 6 5 NA 7 ...
## ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 1350 obs. of 4 variables:
## .. ..$ row : int 2 3 9 13 16 25 33 49 55 57 ...
## .. ..$ col : int NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ expected: chr "a number" "a number" "a number" "a number" ...
## .. ..$ actual : chr "new" "new" "new" "new" ...
## $ Movie : chr "Joker" "The Addams Family" "Gemini Man" "Abominable" ...
## $ Distributor : chr "Warner Bros." "United Artists" "Paramount Pictures" "Universal" ...
## $ Gross : num 55861403 30300007 20552372 6072235 4881075 ...
## $ Change : num -42 NA NA -49 -39 -39 -30 -41 NA -55 ...
## ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 4 variables:
## .. ..$ row : int 12 569 583 591 736 882 954 955 1029 1114 ...
## .. ..$ col : int NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ expected: chr "a number" "a number" "a number" "a number" ...
## .. ..$ actual : chr "n/c" "n/c" "n/c" "n/c" ...
## $ Thtrs. : num 4374 4007 3642 3496 3019 ...
## $ Per.Thtr. : num 12771 7562 5643 1737 1617 ...
## $ Total.Gross : num 1.94e+08 3.03e+07 2.06e+07 4.79e+07 8.27e+07 ...
## $ Week : num 2 1 1 3 4 5 3 6 1 4 ...
## $ Date : Date, format: "2019-10-11" "2019-10-11" ...
str(budget)
## 'data.frame': 5855 obs. of 5 variables:
## $ ReleaseDate : Date, format: "2009-12-17" "2011-05-20" ...
## $ Movie : chr "Avatar" "Pirates of the Caribbean: On Stranger Tides" "Avengers: Endgame" "Avengers: Age of Ultron" ...
## $ ProductionBudget: num 4.25e+08 4.11e+08 4.00e+08 3.31e+08 3.17e+08 ...
## $ DomesticGross : num 7.61e+08 2.41e+08 8.58e+08 4.59e+08 6.20e+08 ...
## $ WorldwideGross : num 2.79e+09 1.05e+09 2.80e+09 1.40e+09 1.32e+09 ...
box
data.new_box <- box %>% left_join(budget, by = "Movie")
dim(box)
## [1] 32885 11
dim(new_box)
## [1] 33420 15
There are more observation in new_box
than there were in box
– this means we have introduced duplicates!
budget %>% filter(Movie=="Beauty and the Beast")
## ReleaseDate Movie ProductionBudget DomesticGross
## 1 2017-03-16 Beauty and the Beast 1.6e+08 504014165
## 2 1991-11-13 Beauty and the Beast 2.0e+07 206333165
## WorldwideGross
## 1 1259199706
## 2 438707031
anti_join
to detect problematic cases.First, we can look at the movies where we don’t know - for whatever reason - the budget:
anti1 <- box %>% anti_join(budget, by="Movie")
head(anti1)
## Rank Rank.Last.Week Movie Distributor Gross Change
## 1 3 NA Gemini Man Paramount Pictures 20552372 NA
## 2 4 2 Abominable Universal 6072235 -49
## 3 5 3 Downton Abbey Focus Features 4881075 -39
## 4 7 6 Judy LD Entertainment 3217960 -30
## 5 9 NA Jexi Lionsgate 3106730 NA
## 6 11 8 Rambo: Last Blood Lionsgate 1499971 -58
## Thtrs. Per.Thtr. Total.Gross Week Date
## 1 3642 5643 20552372 1 2019-10-11
## 2 3496 1737 47873585 3 2019-10-11
## 3 3019 1617 82668665 4 2019-10-11
## 4 1627 1978 14936811 3 2019-10-11
## 5 2332 1332 3106730 1 2019-10-11
## 6 1831 819 42860557 4 2019-10-11
We can also look at the movies that we have budget information for, but no box office data:
anti2 <- budget %>% anti_join(box, by="Movie")
head(anti2)
## ReleaseDate Movie ProductionBudget
## 1 2009-12-17 Avatar 425000000
## 2 2011-05-20 Pirates of the Caribbean: On Stranger Tides 410600000
## 3 2015-04-22 Avengers: Age of Ultron 330600000
## 4 2017-12-13 Star Wars Ep. VIII: The Last Jedi 317000000
## 5 2015-12-16 Star Wars Ep. VII: The Force Awakens 306000000
## 6 2007-05-24 Pirates of the Caribbean: At World's End 300000000
## DomesticGross WorldwideGross
## 1 760507625 2789705275
## 2 241063875 1045663875
## 3 459005868 1403013963
## 4 620181382 1316721747
## 5 936662225 2053311220
## 6 309420425 963420425
For this exercise, use the data from the nycflight13
package.
library(nycflights13)
lat
and lon
) from the airports
data to the flights
dataYou can perform one join after another. If duplicate variables are found, by default, dplyr will distinguish the two by adding .x
, and .y
to the ends of the variable names to solve naming conflicts. However, we can use the suffix
argument to override this default behavior. Since is always good practice to have clear variable names, I will use the suffixes "_dest"
and "_origin"
to specify whether the column refers to the destination or origin airport.
airport_locations <- airports %>%
select(faa, lat, lon)
flights %>%
select(year:day, hour, origin, dest) %>%
left_join(
airport_locations,
by = c("origin" = "faa")
) %>%
left_join(
airport_locations,
by = c("dest" = "faa"),
suffix = c("_origin", "_dest")
# existing lat and lon variables in tibble gain the "_origin" suffix
# new lat and lon variables are given "_dest" suffix
)
## # A tibble: 336,776 x 10
## year month day hour origin dest lat_origin lon_origin lat_dest lon_dest
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH 40.7 -74.2 30.0 -95.3
## 2 2013 1 1 5 LGA IAH 40.8 -73.9 30.0 -95.3
## 3 2013 1 1 5 JFK MIA 40.6 -73.8 25.8 -80.3
## 4 2013 1 1 5 JFK BQN 40.6 -73.8 NA NA
## 5 2013 1 1 6 LGA ATL 40.8 -73.9 33.6 -84.4
## 6 2013 1 1 5 EWR ORD 40.7 -74.2 42.0 -87.9
## 7 2013 1 1 6 EWR FLL 40.7 -74.2 26.1 -80.2
## 8 2013 1 1 6 LGA IAD 40.8 -73.9 38.9 -77.5
## 9 2013 1 1 6 JFK MCO 40.6 -73.8 28.4 -81.3
## 10 2013 1 1 6 LGA ORD 40.8 -73.9 42.0 -87.9
## # … with 336,766 more rows
To answer this question, we need to take the follwing steps:
- join the tables flights
and planes
- calculate the average arrival delay for each age of a flight - since there are few planes older than 25 years, truncate age at 25 years
- plot age against the average arrival delay
plane_ages <- planes %>%
select(tailnum, plane_year = year) # rename year to be plane_year for clarity
flights %>%
inner_join(plane_ages, by = "tailnum") %>%
mutate(plane_age = year - plane_year) %>%
filter(!is.na(plane_age)) %>%
mutate(plane_age = if_else(plane_age > 25, 25L, plane_age)) %>%
group_by(plane_age) %>%
summarise(
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
n_arr_delay = sum(!is.na(arr_delay))
) %>%
ggplot(aes(x = plane_age, y = arr_delay_mean)) +
geom_point() +
labs(x = "Age of plane (years)", y = "Mean Arrival Delay (minutes)")
To answer this question, we need to take the follwing steps:
- join the tables flights
and weather
- calculate the mean departure delay for each amount of precipitation
- plot the results
flight_weather <-
flights %>%
inner_join(weather, by = c(
"origin" = "origin",
"year" = "year",
"month" = "month",
"day" = "day",
"hour" = "hour"
))
flight_weather %>%
group_by(precip) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = precip, y = delay)) +
geom_line() + geom_point()
Almost any amount of precipitation is associated with a delay. However, there is not a strong a trend above 0.02 in. of precipitation.
anti_join(flights, airports, by = c("dest" = "faa"))
tell you?The expression anti_join(flights, airports, by = c("dest" = "faa"))
returns the flights that went to an airport that is not in the FAA list of destinations. Since the FAA list only contains domestic airports, these are likely foreign flights.
anti_join(airports, flights, by = c("faa" = "dest"))
tell you?The expression anti_join(airports, flights, by = c("faa" = "dest"))
returns the US airports that were not the destination of any flight in the data. Since the data contains all flights from New York City airports, this is also the list of US airports that did not have a nonstop flight from New York City in 2013.