dplyr functions

class: center, middle, inverse, title-slide

# dplyr functions

---

background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png)
background-size: 120px
background-position: 92% 5%

## the `dplyr` package

Functions are thought of as **verbs** that manipulate data frames

- `filter()`: pick rows by matching some criteria

- `slice()` pick rows using index(es)

- `select()`: select columns of a data frame by name

- `pull()`: grab a column as a vector

- `arrange()`: reorder the rows of a data frame

- `mutate()`: add new or change existing columns of the data frame (as functions of existing columns)

- `summarize()`: collapse many values down into a summary of the data frame

- ...

These can all be used with `group_by()` which changes the scope of function from entire dataset to group-by-group.

---
background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png)
background-size: 120px
background-position: 92% 5%
## the `dplyr` package

Rules for functions:

- First argument is always a data frame; e.g. `function(data, arg2, ...)`
  
  - Subsequent arguments say what to do with that data frame

- Always return a data frame

- Don't modify in place
 
 
The common structure makes it easy to chain together multiple simple steps using the pipe function (`%>%`)

-  e.g: `function(data, ...)` becomes `data %>% function(...)`

---
## Keywords: which function is it?

- `mutate()`: introduce, replace, reorder, ...

- `summarise()`: calculate, average, summary, ...

- `group_by()`: for each of,

- `filter()`: exclude, only consider, subset, ...

---
## Potential traps

- using the `$` notation in tidyverse can lead to strange behavior and error messages

- don't forget to save statements back into the dataset (`mutate()`, `arrange()`) or new data objects (`summarise()`, `filter()`)

- when using the pipe `%>%`: what is output from lhs, first parameter on rhs?

---
class: inverse, center, middle
# Examples

---
## Set-up

To explore the basic data manipulation verbs of dplyr, we’ll use the `flights` data found in the `nycflights13` package.

This data contains 336,776 flights that departed from NYC in 2013.

```r
library(dplyr)
library(nycflights13)
data(flights, package = "nycflights13")
flights
```

```
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
```

---
## Create a subset

### `filter()`

filter to select a subset of rows: `flights %>% filter(month == 1)`

filter for many conditions at once: `flights %>% filter(month == 1, day == 1)`

### `slice()`

slice for certain row numbers: `flights %>% slice(1:6)`

### `sample_n()` & `sample_frac()`

grab a random sample: `flights %>% sample_n(5, replace = FALSE)`

---
## Select columns

### `select()`

select to keep variables: `flights %>% select(year, month, day)`

select to exclude variables: `flights %>% select(-c(year, month, day))`

select a range of variables: `flights %>% select(year:day)`

### Helper functions to use within `select()`:

matches names that begin with "abc": `starts_with("abc")`

matches names that end with "xyz": `ends_with("xyz")`

matches names that contain "ijk": `contains("ijk")`

---
## & more ...

### `arrange()`

arrange a data set by the values in one or more variables:  
&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% arrange(year, month, day)`

arrange in the reverse the order:  
&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% arrange(desc(dep_delay))`

### `distinct()`

filter for unique rows:  
&nbsp;&nbsp;&nbsp;&nbsp;  `flights %>% select(dest) %>% distinct() %>% arrange(dest))`

### `pull()`

pull to extract a column as a vector:  
&nbsp;&nbsp;&nbsp;&nbsp;  `flights %>% slice(1:6) %>% pull(carrier)`

---
class: yourturn
# Your turn

- Find all flights that had an arrival delay of two or more hours.

- Find all flights that were delayed by at least an hour, but made up over 30 minutes in flight

- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights.`

- What happens if you include the name of a variable multiple times in a `select()` call?

- Sort flights to find the most delayed flights. Find the flights that left earliest.

- Which flights traveled the longest? Which traveled the shortest?

---
## Add or transform variables

### `mutate()`

mutate to add new variables:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% mutate(gain = dep_delay - arr_delay)`

refer to columns that you’ve just created:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% `  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `mutate(`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `gain = dep_delay - arr_delay,`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `hours = air_time / 60,`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;  `gain_per_hour = gain / hours`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `)`

transform existing variable:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% mutate(origin = factor(origin))`

---
class: yourturn
# Your turn

- Calculate the air speed of each flight in miles per minute (create `mpm`) using `distance` and `air_time`.

- Add to the `mutate()` statement from above to calculate the air speed in miles per hour (create `mph`).

- Compute `arr_hour` and `arr_min` from `arr_time`. **Hint**: use modular arithmetic `%/%` for hour and `%%` for minute

- Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Check `?min_rank` vs. `?row_number` vs. `?dense_rank` Print the top 10 in descending order according to the amount of delay.

- Compare `dep_time`, `sched_dep_time`, and `dep_delay.` What do you expect to see? What do you see? What do you need to do to fix it?

---
## Grouping

### `group_by()`

Do calculations on groups.

When you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group".

---
## Create summaries

### `summarise()`

summarise to reduce variables to a value:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% `  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `summarise(delay = mean(dep_delay, na.rm = TRUE))`

### `group_by()` & `summarise()`

summarise groups:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% `  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;   `group_by(dest) %>% `  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `summarise(`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `dist = mean(distance, na.rm = TRUE),`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `delay = mean(arr_delay, na.rm = TRUE)`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `)`

---
## Grouping helper functions

### `n()`

number of rows: `flights %>% group_by(dest) %>% summarise(count = n())`

### `tally()`

A combination of `summarise()` and `n()`: `flights %>% group_by(dest) %>% tally()`

### `count()`

A further shortcut of `group_by()` and `tally()`: `flights %>% count(dest)`

You can even add a weight variable: `flights %>% count(tailnum, wt = distance)`

---
## Grouped mutates & filters

Filter per group:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% `  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `group_by(dest) %>% `  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `filter(n() >365)`

Compute per group metrics:

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>%`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `  filter(arr_delay > 0) %>%`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `mutate(prop_delay = arr_delay / sum(arr_delay)) %>%`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `select(year:day, dest, arr_delay, prop_delay) %>%`  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `arrange(desc(prop_delay))`

---
class: yourturn
# Your turn

- Calculate the average delay per date.

- What time of day should you fly if you want to avoid delays as much as possible?

- Explore the relationship between the distance and average delay for each destination. Also calculate the number of flights flown to each destination.

---
class: yourturn
# Your turn

- Which carrier has the worst delays?

- Rank airlines by the number of destinations that they fly to, considering only those airports that are flown to by two or more airlines.
  
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

---
## Grouping & ungrouping

### `ungroup()`

remove a grouping structure from a data set: 

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% group_by(carrier) %>% select(origin)`

&nbsp;&nbsp;&nbsp;&nbsp; `flights %>% group_by(carrier) %>% ungroup() %>% select(origin)`

&nbsp;&nbsp;&nbsp;&nbsp; `daily <- flights %>% group_by(year, month, day)`

&nbsp;&nbsp;&nbsp;&nbsp; `daily %>% summarise(flights = n())`

&nbsp;&nbsp;&nbsp;&nbsp; `daily %>% ungroup() %>% summarise(flights = n())`

---
## Resources

- reference/document: http://dplyr.tidyverse.org/reference/

- RStudio cheat sheet for [dplyr](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)