JOINING DATA: MESSY 4

YOUR TURN #1

For this YOUR TURN use the HallOfFame and Master data from the Lahman package

Q: Identify all players who were inducted in the Hall of Fame in 2017, by filtering the `Master` data for their player IDs.

Load the package and take a look at the metadata for the 24 included data sets. You can also take a look at the first couple of observations in a few of the data sets.

library(tidyverse)
library(Lahman)
LahmanData

##                   file      class   nobs nvar                     title
## 1          AllstarFull data.frame   4993    8         AllstarFull table
## 2          Appearances data.frame  99466   21         Appearances table
## 3       AwardsManagers data.frame    171    6      AwardsManagers table
## 4        AwardsPlayers data.frame   6026    6       AwardsPlayers table
## 5  AwardsShareManagers data.frame    401    7 AwardsShareManagers table
## 6   AwardsSharePlayers data.frame   6705    7  AwardsSharePlayers table
## 7              Batting data.frame  99846   22             Batting table
## 8          BattingPost data.frame  11294   22         BattingPost table
## 9       CollegePlaying data.frame  17350    3      CollegePlaying table
## 10            Fielding data.frame 167938   18            Fielding table
## 11          FieldingOF data.frame  12028    6          FieldingOF table
## 12        FieldingPost data.frame  11924   17         FieldingPost data
## 13          HallOfFame data.frame   4088    9  Hall of Fame Voting Data
## 14            Managers data.frame   3370   10            Managers table
## 15        ManagersHalf data.frame     93   10        ManagersHalf table
## 16              Master data.frame  18589   26              Master table
## 17            Pitching data.frame  43330   30            Pitching table
## 18        PitchingPost data.frame   4945   30        PitchingPost table
## 19            Salaries data.frame  24758    5            Salaries table
## 20             Schools data.frame   1207    5             Schools table
## 21          SeriesPost data.frame    298    9          SeriesPost table
## 22               Teams data.frame   2775   48               Teams table
## 23     TeamsFranchises data.frame    120    4      TeamFranchises table
## 24           TeamsHalf data.frame     52   10           TeamsHalf table

AllstarFull %>%  head()

##    playerID yearID gameNum       gameID teamID lgID GP startingPos
## 1 gomezle01   1933       0 ALS193307060    NYA   AL  1           1
## 2 ferreri01   1933       0 ALS193307060    BOS   AL  1           2
## 3 gehrilo01   1933       0 ALS193307060    NYA   AL  1           3
## 4 gehrich01   1933       0 ALS193307060    DET   AL  1           4
## 5 dykesji01   1933       0 ALS193307060    CHA   AL  1           5
## 6 cronijo01   1933       0 ALS193307060    WS1   AL  1           6

Master %>% head()

##    playerID birthYear birthMonth birthDay birthCountry birthState  birthCity
## 1 aardsda01      1981         12       27          USA         CO     Denver
## 2 aaronha01      1934          2        5          USA         AL     Mobile
## 3 aaronto01      1939          8        5          USA         AL     Mobile
## 4  aasedo01      1954          9        8          USA         CA     Orange
## 5  abadan01      1972          8       25          USA         FL Palm Beach
## 6  abadfe01      1985         12       17         D.R.  La Romana  La Romana
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1        NA         NA       NA         <NA>       <NA>      <NA>     David
## 2        NA         NA       NA         <NA>       <NA>      <NA>      Hank
## 3      1984          8       16          USA         GA   Atlanta    Tommie
## 4        NA         NA       NA         <NA>       <NA>      <NA>       Don
## 5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
## 6        NA         NA       NA         <NA>       <NA>      <NA>  Fernando
##   nameLast        nameGiven weight height bats throws      debut  finalGame
## 1  Aardsma      David Allan    215     75    R      R 2004-04-06 2015-08-23
## 2    Aaron      Henry Louis    180     72    R      R 1954-04-13 1976-10-03
## 3    Aaron       Tommie Lee    190     75    R      R 1962-04-10 1971-09-26
## 4     Aase   Donald William    190     75    R      R 1977-07-26 1990-10-03
## 5     Abad    Fausto Andres    184     73    L      L 2001-09-10 2006-04-13
## 6     Abad Fernando Antonio    220     73    L      L 2010-07-28 2017-10-01
##    retroID   bbrefID  deathDate  birthDate
## 1 aardd001 aardsda01       <NA> 1981-12-27
## 2 aaroh101 aaronha01       <NA> 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001  aasedo01       <NA> 1954-09-08
## 5 abada001  abadan01       <NA> 1972-08-25
## 6 abadf001  abadfe01       <NA> 1985-12-17

To answer this question we will use dplyr verbs to obtain a vector of all of the players inducted into the Hall of Fame in 2017 according to their player ids:

player_ids <- HallOfFame %>% filter(yearID == 2017) %>% pull(playerID)
head(player_ids)

## [1] "bagweje01" "raineti01" "rodriiv01" "hoffmtr01" "guerrvl01" "martied01"

We can then use that vector to filter the Master data by the player ids:

Master %>% filter(playerID %in% player_ids) %>% str()

## 'data.frame':    36 obs. of  26 variables:
##  $ playerID    : chr  "bagweje01" "blakeca01" "bondsba01" "burrepa01" ...
##  $ birthYear   : int  1968 1973 1964 1976 1974 1973 1962 1975 1975 1975 ...
##  $ birthMonth  : int  5 8 7 10 11 1 8 11 2 9 ...
##  $ birthDay    : int  27 23 24 10 2 8 4 20 9 30 ...
##  $ birthCountry: chr  "USA" "USA" "USA" "USA" ...
##  $ birthState  : chr  "MA" "IA" "CA" "AR" ...
##  $ birthCity   : chr  "Boston" "Des Moines" "Riverside" "Eureka Springs" ...
##  $ deathYear   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ deathMonth  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ deathDay    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ deathCountry: chr  NA NA NA NA ...
##  $ deathState  : chr  NA NA NA NA ...
##  $ deathCity   : chr  NA NA NA NA ...
##  $ nameFirst   : chr  "Jeff" "Casey" "Barry" "Pat" ...
##  $ nameLast    : chr  "Bagwell" "Blake" "Bonds" "Burrell" ...
##  $ nameGiven   : chr  "Jeffrey Robert" "William Casey" "Barry Lamar" "Patrick Brian" ...
##  $ weight      : int  195 205 185 235 195 210 205 200 235 215 ...
##  $ height      : int  72 74 73 76 70 74 76 73 75 73 ...
##  $ bats        : Factor w/ 3 levels "B","L","R": 3 3 2 3 3 3 3 2 3 1 ...
##  $ throws      : Factor w/ 3 levels "L","R","S": 2 2 1 2 2 2 2 2 2 2 ...
##  $ debut       : chr  "1991-04-08" "1999-08-14" "1986-05-30" "2000-05-24" ...
##  $ finalGame   : chr  "2005-10-02" "2011-08-31" "2007-09-26" "2011-09-28" ...
##  $ retroID     : chr  "bagwj001" "blakc001" "bondb001" "burrp001" ...
##  $ bbrefID     : chr  "bagweje01" "blakeca01" "bondsba01" "burrepa01" ...
##  $ deathDate   : Date, format: NA NA ...
##  $ birthDate   : Date, format: "1968-05-27" "1973-08-23" ...

YOUR TURN #2

Q: Join (relevant pieces of) the `Master` data set and the `HallOfFame` data.

The trick here is to recognize that we want to add the relevant materials from the Master data to the HallOfFame data.

dim(HallOfFame)

## [1] 4191    9

fame_master <- HallOfFame %>% 
  left_join(Master, by="playerID")
dim(fame_master)

## [1] 4191   34

What would happen if we had (incorrectly) reversed this?

master_fame <- Master %>% 
  left_join(HallOfFame, by="playerID")
dim(master_fame)

## [1] 22529    34

Q: Find all Hall of Famers who were alive as of 2019. (use the data resulted from the joining done in question 1)

fame_master %>% 
  filter(inducted=="Y", is.na(deathYear)) %>% 
  select(nameFirst, nameLast) %>% 
  arrange(nameLast) %>% 
  head()

##   nameFirst nameLast
## 1      Hank    Aaron
## 2   Roberto   Alomar
## 3      Luis Aparicio
## 4      Jeff  Bagwell
## 5    Johnny    Bench
## 6     Craig   Biggio

Q: How many attempts at being inducted to the HoF does Sammy Sosa have already?

fame_master %>% filter(nameLast == "Sosa")

##   playerID yearID votedBy ballots needed votes inducted category needed_note
## 1 sosasa01   2013   BBWAA     569    427    71        N   Player        <NA>
## 2 sosasa01   2014   BBWAA     571    429    41        N   Player        <NA>
## 3 sosasa01   2015   BBWAA     549    412    36        N   Player        <NA>
## 4 sosasa01   2016   BBWAA     440    330    31        N   Player        <NA>
## 5 sosasa01   2017   BBWAA     442    332    38        N   Player        <NA>
## 6 sosasa01   2018   BBWAA     422    317    33        N   Player        <NA>
##   birthYear birthMonth birthDay birthCountry           birthState
## 1      1968         11       12         D.R. San Pedro de Macoris
## 2      1968         11       12         D.R. San Pedro de Macoris
## 3      1968         11       12         D.R. San Pedro de Macoris
## 4      1968         11       12         D.R. San Pedro de Macoris
## 5      1968         11       12         D.R. San Pedro de Macoris
## 6      1968         11       12         D.R. San Pedro de Macoris
##              birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 San Pedro de Macoris        NA         NA       NA         <NA>       <NA>
## 2 San Pedro de Macoris        NA         NA       NA         <NA>       <NA>
## 3 San Pedro de Macoris        NA         NA       NA         <NA>       <NA>
## 4 San Pedro de Macoris        NA         NA       NA         <NA>       <NA>
## 5 San Pedro de Macoris        NA         NA       NA         <NA>       <NA>
## 6 San Pedro de Macoris        NA         NA       NA         <NA>       <NA>
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1      <NA>     Sammy     Sosa Samuel Peralta    165     72    R      R
## 2      <NA>     Sammy     Sosa Samuel Peralta    165     72    R      R
## 3      <NA>     Sammy     Sosa Samuel Peralta    165     72    R      R
## 4      <NA>     Sammy     Sosa Samuel Peralta    165     72    R      R
## 5      <NA>     Sammy     Sosa Samuel Peralta    165     72    R      R
## 6      <NA>     Sammy     Sosa Samuel Peralta    165     72    R      R
##        debut  finalGame  retroID  bbrefID deathDate  birthDate
## 1 1989-06-16 2007-09-29 sosas001 sosasa01      <NA> 1968-11-12
## 2 1989-06-16 2007-09-29 sosas001 sosasa01      <NA> 1968-11-12
## 3 1989-06-16 2007-09-29 sosas001 sosasa01      <NA> 1968-11-12
## 4 1989-06-16 2007-09-29 sosas001 sosasa01      <NA> 1968-11-12
## 5 1989-06-16 2007-09-29 sosas001 sosasa01      <NA> 1968-11-12
## 6 1989-06-16 2007-09-29 sosas001 sosasa01      <NA> 1968-11-12

YOUR TURN #3

For this exercise, use the data from the classdata package

Q: Load the `classdata` package into your R session.

You will need to install the most current version of the package in order to complete this exercise.

# devtools::install_github("haleyjeppson/classdata", force = TRUE)
library(classdata)

Q: Investigate data sets `box` and `budget`.

First, load the data:

data(box)
data(budget)

Take a first look at the datasets:

str(box)

## 'data.frame':    32885 obs. of  11 variables:
##  $ Rank          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Rank.Last.Week: num  1 NA NA 2 3 4 6 5 NA 7 ...
##   ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame':  1350 obs. of  4 variables:
##   .. ..$ row     : int  2 3 9 13 16 25 33 49 55 57 ...
##   .. ..$ col     : int  NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ expected: chr  "a number" "a number" "a number" "a number" ...
##   .. ..$ actual  : chr  "new" "new" "new" "new" ...
##  $ Movie         : chr  "Joker" "The Addams Family" "Gemini Man" "Abominable" ...
##  $ Distributor   : chr  "Warner Bros." "United Artists" "Paramount Pictures" "Universal" ...
##  $ Gross         : num  55861403 30300007 20552372 6072235 4881075 ...
##  $ Change        : num  -42 NA NA -49 -39 -39 -30 -41 NA -55 ...
##   ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame':  30 obs. of  4 variables:
##   .. ..$ row     : int  12 569 583 591 736 882 954 955 1029 1114 ...
##   .. ..$ col     : int  NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ expected: chr  "a number" "a number" "a number" "a number" ...
##   .. ..$ actual  : chr  "n/c" "n/c" "n/c" "n/c" ...
##  $ Thtrs.        : num  4374 4007 3642 3496 3019 ...
##  $ Per.Thtr.     : num  12771 7562 5643 1737 1617 ...
##  $ Total.Gross   : num  1.94e+08 3.03e+07 2.06e+07 4.79e+07 8.27e+07 ...
##  $ Week          : num  2 1 1 3 4 5 3 6 1 4 ...
##  $ Date          : Date, format: "2019-10-11" "2019-10-11" ...

str(budget)

## 'data.frame':    5855 obs. of  5 variables:
##  $ ReleaseDate     : Date, format: "2009-12-17" "2011-05-20" ...
##  $ Movie           : chr  "Avatar" "Pirates of the Caribbean: On Stranger Tides" "Avengers: Endgame" "Avengers: Age of Ultron" ...
##  $ ProductionBudget: num  4.25e+08 4.11e+08 4.00e+08 3.31e+08 3.17e+08 ...
##  $ DomesticGross   : num  7.61e+08 2.41e+08 8.58e+08 4.59e+08 6.20e+08 ...
##  $ WorldwideGross  : num  2.79e+09 1.05e+09 2.80e+09 1.40e+09 1.32e+09 ...

Q: Join the two datasets to incorporate the release date of movies into the `box` data.

new_box <- box %>% left_join(budget, by = "Movie")

Q: Check on the dimensions of the data sets before and after the join. Where are potential problems?

dim(box)

## [1] 32885    11

dim(new_box)

## [1] 33420    15

There are more observation in new_box than there were in box – this means we have introduced duplicates!

budget %>% filter(Movie=="Beauty and the Beast")

##   ReleaseDate                Movie ProductionBudget DomesticGross
## 1  2017-03-16 Beauty and the Beast          1.6e+08     504014165
## 2  1991-11-13 Beauty and the Beast          2.0e+07     206333165
##   WorldwideGross
## 1     1259199706
## 2      438707031

Q: Use `anti_join` to detect problematic cases.

First, we can look at the movies where we don’t know - for whatever reason - the budget:

anti1 <- box %>% anti_join(budget, by="Movie")
head(anti1)

##   Rank Rank.Last.Week             Movie        Distributor    Gross Change
## 1    3             NA        Gemini Man Paramount Pictures 20552372     NA
## 2    4              2        Abominable          Universal  6072235    -49
## 3    5              3     Downton Abbey     Focus Features  4881075    -39
## 4    7              6              Judy   LD Entertainment  3217960    -30
## 5    9             NA              Jexi          Lionsgate  3106730     NA
## 6   11              8 Rambo: Last Blood          Lionsgate  1499971    -58
##   Thtrs. Per.Thtr. Total.Gross Week       Date
## 1   3642      5643    20552372    1 2019-10-11
## 2   3496      1737    47873585    3 2019-10-11
## 3   3019      1617    82668665    4 2019-10-11
## 4   1627      1978    14936811    3 2019-10-11
## 5   2332      1332     3106730    1 2019-10-11
## 6   1831       819    42860557    4 2019-10-11

We can also look at the movies that we have budget information for, but no box office data:

anti2 <- budget %>% anti_join(box, by="Movie")
head(anti2)

##   ReleaseDate                                       Movie ProductionBudget
## 1  2009-12-17                                      Avatar        425000000
## 2  2011-05-20 Pirates of the Caribbean: On Stranger Tides        410600000
## 3  2015-04-22                     Avengers: Age of Ultron        330600000
## 4  2017-12-13           Star Wars Ep. VIII: The Last Jedi        317000000
## 5  2015-12-16        Star Wars Ep. VII: The Force Awakens        306000000
## 6  2007-05-24    Pirates of the Caribbean: At World's End        300000000
##   DomesticGross WorldwideGross
## 1     760507625     2789705275
## 2     241063875     1045663875
## 3     459005868     1403013963
## 4     620181382     1316721747
## 5     936662225     2053311220
## 6     309420425      963420425

YOUR TURN #4

For this exercise, use the data from the nycflight13 package.

library(nycflights13)

Q: Add the location of the origin and destination (i.e. the `lat` and `lon`) from the `airports` data to the `flights` data

You can perform one join after another. If duplicate variables are found, by default, dplyr will distinguish the two by adding .x, and .y to the ends of the variable names to solve naming conflicts. However, we can use the suffix argument to override this default behavior. Since is always good practice to have clear variable names, I will use the suffixes "_dest" and "_origin" to specify whether the column refers to the destination or origin airport.

airport_locations <- airports %>%
  select(faa, lat, lon)

flights %>%
  select(year:day, hour, origin, dest) %>%
  left_join(
    airport_locations,
    by = c("origin" = "faa")
  ) %>%
  left_join(
    airport_locations,
    by = c("dest" = "faa"),
    suffix = c("_origin", "_dest")
    # existing lat and lon variables in tibble gain the "_origin" suffix
    # new lat and lon variables are given "_dest" suffix
  )

## # A tibble: 336,776 x 10
##     year month   day  hour origin dest  lat_origin lon_origin lat_dest lon_dest
##    <int> <int> <int> <dbl> <chr>  <chr>      <dbl>      <dbl>    <dbl>    <dbl>
##  1  2013     1     1     5 EWR    IAH         40.7      -74.2     30.0    -95.3
##  2  2013     1     1     5 LGA    IAH         40.8      -73.9     30.0    -95.3
##  3  2013     1     1     5 JFK    MIA         40.6      -73.8     25.8    -80.3
##  4  2013     1     1     5 JFK    BQN         40.6      -73.8     NA       NA  
##  5  2013     1     1     6 LGA    ATL         40.8      -73.9     33.6    -84.4
##  6  2013     1     1     5 EWR    ORD         40.7      -74.2     42.0    -87.9
##  7  2013     1     1     6 EWR    FLL         40.7      -74.2     26.1    -80.2
##  8  2013     1     1     6 LGA    IAD         40.8      -73.9     38.9    -77.5
##  9  2013     1     1     6 JFK    MCO         40.6      -73.8     28.4    -81.3
## 10  2013     1     1     6 LGA    ORD         40.8      -73.9     42.0    -87.9
## # … with 336,766 more rows

Q: Is there a relationship between the age of a plane and its arrival delays?

To answer this question, we need to take the follwing steps:
- join the tables flights and planes
- calculate the average arrival delay for each age of a flight - since there are few planes older than 25 years, truncate age at 25 years
- plot age against the average arrival delay

plane_ages <- planes %>% 
  select(tailnum, plane_year = year) # rename year to be plane_year for clarity

flights %>% 
  inner_join(plane_ages, by = "tailnum") %>%
  mutate(plane_age = year - plane_year) %>%
  filter(!is.na(plane_age)) %>%
  mutate(plane_age = if_else(plane_age > 25, 25L, plane_age)) %>%
  group_by(plane_age) %>%
  summarise(
    arr_delay_mean = mean(arr_delay, na.rm = TRUE),
    n_arr_delay = sum(!is.na(arr_delay))
    ) %>% 
  ggplot(aes(x = plane_age, y = arr_delay_mean)) +
  geom_point()  +
  labs(x = "Age of plane (years)", y = "Mean Arrival Delay (minutes)")

Q: What weather conditions make it more likely to see a departure delay?

To answer this question, we need to take the follwing steps:
- join the tables flights and weather
- calculate the mean departure delay for each amount of precipitation
- plot the results

flight_weather <-
  flights %>%
  inner_join(weather, by = c(
    "origin" = "origin",
    "year" = "year",
    "month" = "month",
    "day" = "day",
    "hour" = "hour"
  ))

flight_weather %>%
  group_by(precip) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = precip, y = delay)) +
  geom_line() + geom_point()

Almost any amount of precipitation is associated with a delay. However, there is not a strong a trend above 0.02 in. of precipitation.

YOUR TURN #5

Q: What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you?

The expression anti_join(flights, airports, by = c("dest" = "faa")) returns the flights that went to an airport that is not in the FAA list of destinations. Since the FAA list only contains domestic airports, these are likely foreign flights.

Q: What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you?

The expression anti_join(airports, flights, by = c("faa" = "dest")) returns the US airports that were not the destination of any flight in the data. Since the data contains all flights from New York City airports, this is also the list of US airports that did not have a nonstop flight from New York City in 2013.

JOINING DATA: MESSY 4

YOUR TURN #1

Q: Identify all players who were inducted in the Hall of Fame in 2017, by filtering the Master data for their player IDs.

YOUR TURN #2

Q: Join (relevant pieces of) the Master data set and the HallOfFame data.