Joining data: Messy 4

class: center, middle, inverse, title-slide

# Joining data: Messy 4

---

## Messy (4)

**Messy (4)**: Values for a single observational unit are stored across multiple tables.

After data are normalized by splitting, we want to de-normalize again by *joining* datasets.

The variables used to connect each pair of tables are the **keys**.

---
## Relational data

It is not uncommon for a data analysis to involve many tables of data and you must combine them to answer the questions that you’re interested in.

Collectively, multiple tables of data are called **relational data** because it is the relations, not just the individual datasets, that are important.

To work with relational data you need **verbs** that work with pairs of tables. There are three families of verbs designed to work with relational data:

1. **Mutating joins**, which add new variables to one data frame from matching observations in another.

2. **Filtering joins**, which filter observations from one data frame based on whether or not they match an observation in the other table.

3. **Set operations**, which treat observations as if they were set elements.

---
class: inverse, middle, center
# Joining data from different tables

---

## Example: Lahman package

Sean Lahman is a database journalist, who started databases of historical sports statistics, in particular, the Lahman database on baseball.

```r
library(Lahman)
LahmanData %>% head()
```

```
##                  file      class  nobs nvar                     title
## 1         AllstarFull data.frame  4993    8         AllstarFull table
## 2         Appearances data.frame 99466   21         Appearances table
## 3      AwardsManagers data.frame   171    6      AwardsManagers table
## 4       AwardsPlayers data.frame  6026    6       AwardsPlayers table
## 5 AwardsShareManagers data.frame   401    7 AwardsShareManagers table
## 6  AwardsSharePlayers data.frame  6705    7  AwardsSharePlayers table
```

---
## Lahman database

The Lahman database consists of 24 data frames that are linked by `playerID`.

This is clean, but not very readable.

The `Master` table includes names and other attributes for each player.

*Joining* multiple tables helps us to bring together (pieces of) information from multiple sources.

---
## Example: Hall of Fame

Who was inducted in the Hall of Fame in 2017?

```r
HallOfFame %>% dplyr::filter(yearID == 2017) %>% head()
```

```
## playerID yearID votedBy ballots needed votes inducted category needed_note
## 1 bagweje01 2017 BBWAA 442 332 381 Y Player <NA>
## 2 raineti01 2017 BBWAA 442 332 380 Y Player <NA>
## 3 rodriiv01 2017 BBWAA 442 332 336 Y Player <NA>
## 4 hoffmtr01 2017 BBWAA 442 332 327 N Player <NA>
## 5 guerrvl01 2017 BBWAA 442 332 317 N Player <NA>
## 6 martied01 2017 BBWAA 442 332 259 N Player <NA>
```

---
class: yourturn
# Your turn

For this your turn use the `HallOfFame` and `Master` data from the `Lahman` package

- Identify all players who were inducted in the Hall of Fame in 2017, by filtering the `Master` data for their player IDs.

---
## Joins: general idea

![](images/join.png)

---
## Joins: more specific idea

- Data sets are joined along values of variables.

- In `dplyr` there are several join functions: `left_join()`, `inner_join()`, `full_join()`, ...

- The differences between join functions are only visible if values in one set do not all have values in the other set.

---
## Example data

```r
df1 <- data.frame(id = 1:6, trt = rep(c("A", "B", "C"), rep=c(2,1,3)), value = c(5,3,7,1,2,3))
df1
```

```
## id trt value
## 1 1 A 5
## 2 2 B 3
## 3 3 C 7
## 4 4 A 1
## 5 5 B 2
## 6 6 C 3
```

```r
df2 <- data.frame(id=c(4,4,5,5,7,7), stress=rep(c(0,1), 3), bpm = c(65, 125, 74, 136, 48, 110))
df2
```

```
##   id stress bpm
## 1  4      0  65
## 2  4      1 125
## 3  5      0  74
## 4  5      1 136
## 5  7      0  48
## 6  7      1 110
```

---
## Mutating joins

A mutating join allows you to combine variables from two tables.

We first match observations by their keys, then copy across variables from one table to the other.

There are 4 mutating join functions:

- `left_join(x, y, by = "key")`: keeps all observations in `x`
  
- `right_join(x, y, by = "key")`: keeps all observations in `y`

- `full_join(x, y, by = "key")`: keeps all observations in `x` and `y`

- `inner_join(x, y, by = "key")`: keeps only observations that appear in both `x` and `y`

---
## `left_join(x, y, by = "key")`

.pull-left-small[

```r
left_join(df1, df2, by="id")
```

```
##   id trt value stress bpm
## 1  1   A     5     NA  NA
## 2  2   B     3     NA  NA
## 3  3   C     7     NA  NA
## 4  4   A     1      0  65
## 5  4   A     1      1 125
## 6  5   B     2      0  74
## 7  5   B     2      1 136
## 8  6   C     3     NA  NA
```
]
.pull-right-large[

- all elements in the *left* data set are kept

- non-matches are filled in by `NA`

- most commonly used; should be your default

- use this when you look up additional data from another table since it preserves the original observations even when there isn’t a match 
- `right_join()` works symmetric
]

---
## `inner_join(x, y, by = "key")`

.pull-left-small[

```r
inner_join(df1, df2, by = "id")
```

```
##   id trt value stress bpm
## 1  4   A     1      0  65
## 2  4   A     1      1 125
## 3  5   B     2      0  74
## 4  5   B     2      1 136
```
]
.pull-right-large[

- ONLY matches pairs of observations whenever their keys are equal

- output is a new data frame that contains the key, the x values, and the y values.

- **most important property**: unmatched rows are NOT included in the result.

- inner joins are usually not appropriate for use in analysis because it is too easy to lose observations.
  
]
---
## `full_join(x, y, by = "key")`
.pull-left-small[

```r
full_join(df1, df2, by = "id")
```

```
## id trt value stress bpm
## 1 1 A 5 NA NA
## 2 2 B 3 NA NA
## 3 3 C 7 NA NA
## 4 4 A 1 0 65
## 5 4 A 1 1 125
## 6 5 B 2 0 74
## 7 5 B 2 1 136
## 8 6 C 3 NA NA
## 9 7 <NA> NA 0 48
## 10 7 <NA> NA 1 110
```
]
.pull-right-large[

- all ids are kept, missings are filled in with `NA`

]
---
class: yourturn
# Your turn

For this exercise, use the data from the `Lahman` package.

- Join (relevant pieces of) the `Master` data set and the `HallOfFame` data.

- Find all Hall of Famers who were alive as of 2019. (use the data resulted from the joining done in question 1)

- How many attempts at being inducted to the HoF does Sammy Sosa have already? (use the data resulted from the joining done in question 1)

---
## Traps of joins

- sometimes we unexpectedly cannot match values: missing values, different spelling, ...

- join can be along multiple variables, e.g. `by = c("ID", "Date")`

- joining variable(s) can have different names, e.g. `by = c("State" = "Name")`

- always make sure to check dimensions of data before and after a join

- check on missing values:

- to get help with that, we can use `anti_join()`, a filtering join

---
## Filtering joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables.

There are two types:

- `anti_join(x, y, by = "key")` drops all observations in `x` that have a match in `y`.

- `semi_join(x, y, by = "key")` keeps all observations in `x` that have a match in `y`.

---
## `anti_join(x, y, by = "key")`

.pull-left-small[

```r
anti_join(df1, df2, by="id") # no values for id in df2
```

```
##   id trt value
## 1  1   A     5
## 2  2   B     3
## 3  3   C     7
## 4  6   C     3
```

```r
anti_join(df2, df1, by="id") # no values for id in df1
```

```
##   id stress bpm
## 1  7      0  48
## 2  7      1 110
```
]
.pull-right-large[
- keeps rows the *don't* have a match

- be careful, it is NOT symmetric!

- useful for diagnosing join mismatches.

]
---
class: yourturn
# Your turn

For this exercise, use the data from the `classdata` package

- Load the `classdata` package into your R session.

- Investigate data sets `box` and `budget`.

- Join the two datasets to incorporate the release date of movies into the box office gross.

- Check on the dimensions of the data sets before and after the join. Where are potential problems?

- Use `anti_join` to detect problematic cases.

---
class: inverse, middle, center
# As requested, more examples!

---
## `nycflights13`

The `nycflights13` package contains 4 data sets related to the `flights` data that we have used previously.

- `airlines` lets you look up the full carrier name from its abbreviated code

- `airports` gives information about each airport, identified by the faa airport code:

- `planes` gives information about each plane, identified by its `tailnum`:

- `weather` gives the weather at each NYC airport for each hour:

---
- `airlines` lets you look up the full carrier name from its abbreviated code

```
## # A tibble: 2 x 2
## carrier name 
## <chr> <chr> 
## 1 9E Endeavor Air Inc. 
## 2 AA American Airlines Inc.
```

- `airports` gives information about each airport, identified by the faa airport code:

```
## # A tibble: 2 x 8
## faa name lat lon alt tz dst tzone 
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> 
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chicago
```

- `planes` gives information about each plane, identified by its `tailnum`:

```
## # A tibble: 2 x 9
## tailnum year type manufacturer model engines seats speed engine 
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> 
## 1 N10156 2004 Fixed wing mul… EMBRAER EMB-14… 2 55 NA Turbo-…
## 2 N102UW 1998 Fixed wing mul… AIRBUS INDUSTR… A320-2… 2 182 NA Turbo-…
```

- `weather` gives the weather at each NYC airport for each hour:

```
## # A tibble: 2 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## # … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
```

---
background-image: url(https://d33wubrfki0l68.cloudfront.net/245292d1ea724f6c3fd8a92063dcd7bfb9758d02/5751b/diagrams/relational-nycflights.png)

One way to show the relationships between the different tables is with a drawing:

---
## `nycflights13` relations

- `flights` connects to `planes` via a single variable, `tailnum.`

- `flights` connects to `airlines` through the `carrier` variable.

- `flights` connects to `airports` in two ways: via the `origin` and `dest` variables.

- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time).

---
## defining the variables to join by

The default, `by = NULL`, uses all variables that appear in both tables.

For example, the `flights` and `weather` tables have the common vaiables `year`, `month`, `day`, `hour`, and `origin`, but we do not need to define this when calling `left_join()`:

```r
flights %>% left_join(weather)
```

```
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
```

```
## # A tibble: 336,776 x 28
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 20 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>,
## # dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
```

---
## defining the variables to join by

Instead, we can request that only some on the common variables be used by using a character vector, `by = "x"`:

For example, `flights` and `planes` have `year` variables, but they mean different things, so we only want to join by `tailnum`:

(note the changes to the `year` variables)

```r
flights %>% left_join(planes, by = "tailnum")
```

```
## # A tibble: 336,776 x 27
## year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 19 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>,
## # type <chr>, manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## # speed <int>, engine <chr>
```

---
## defining the variables to join by

We can use a named character vector, `by = c("a" = "b")`, when the key is not named the same in the two table. This will match variable `a` in table `x` to variable `b` in table `y`. The variables from `x` will be used in the output.

We want to combine `flights` with `airports` which contains the location (`lat` and `lon`) of each airport, but each flight has an `origin` and `destination` airport, so we need to specify which one we want to join to:

```r
flights %>% left_join(airports, by = c("dest" = "faa"))
```

```
## # A tibble: 336,776 x 26
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 18 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>,
## # lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
```

---
class: yourturn
# YOUR TURN

For this exercise, use the data from the `nycflight13` package.

- Add the location of the origin and destination (i.e. the `lat` and `lon`) from the `airports` data to the `flights` data

- Is there a relationship between the age of a plane and its arrival delays?  
  - Join the tables `flights` and `planes` and calculate the average arrival delay for each age of a flight. Since there are few planes older than 25 years, truncate age at 25 years.
  - Plot age against the average arrival delay.

- What weather conditions make it more likely to see a departure delay?  
  - Join the tables `flights` and `weather`, calculate the mean departure delay for each amount of precipitation, and plot your results.

---
## Add'l filtering join examples

Recall that anti-joins are useful for diagnosing join mismatches.

For example, when connecting `flights` and `planes`, you might be interested to know that there are many `flights` that don’t have a match in `planes`:

```r
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
```

```
## # A tibble: 722 x 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # … with 712 more rows
```

---
class: yourturn
# YOUR TURN

For this exercise, use the data from the `nycflight13` package.

What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you?

What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you?

---
## Resources

- reference/document: http://dplyr.tidyverse.org/reference/

- RStudio cheat sheet for [dplyr](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)