class: center, middle, inverse, title-slide # Data wrangling --- class: inverse, center, top background-image: url(https://github.com/allisonhorst/stats-illustrations/blob/master/rstats-artwork/dplyr_wrangling.png?raw=true) background-size: 650px background-position: 50% 85% # the `dplyr` package --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png) background-size: 120px background-position: 92% 5% ## the `dplyr` package Functions are thought of as **verbs** that manipulate data frames - `filter()`: pick rows by matching some criteria - `slice()` pick rows using index(es) - `select()`: select columns of a data frame by name - `pull()`: grab a column as a vector - `arrange()`: reorder the rows of a data frame - `mutate()`: add new or change existing columns of the data frame (as functions of existing columns) - `summarize()`: collapse many values down into a summary of the data frame - ... These can all be used with `group_by()` which changes the scope of function from entire dataset to group-by-group. --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png) background-size: 120px background-position: 92% 5% ## the `dplyr` package Rules for functions: - First argument is always a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place <br/> A note on piping and layering: - The `%>%` operator in dplyr functions is called the "pipe operator". - This means you "pipe" the output of the previous line of code as the first input of the next line of code. - The `+` operator in ggplot2 functions is used for "layering". - This means you create the plot in layers, separated by `+`. --- ## `filter()` select a subset of the observations (horizontal selection): `filter(.data, ...)` specify constraints (as logical expression) to data in ... all constraints are combined by logical and & .left-third[
] .right-two-thirds[ .pull-left[
] .pull-right[
]] .footnote[ Make sure to always call `library(dplyr)` before using `filter()` ] --- ## `filter()` EXAMPLE From the `fbi` data, extract all burglaries in 2014: ```r library(classdata) library(dplyr) fbi %>% dplyr::filter(Type == "Burglary", Year == 2014) %>% head() ``` ``` ## State Abb Year Population Type Count Violent.crime ## 1 Alabama AL 2014 4849377 Burglary 39715 FALSE ## 2 Alaska AK 2014 736732 Burglary 3150 FALSE ## 3 Arizona AZ 2014 6731484 Burglary 43562 FALSE ## 4 Arkansas AR 2014 2966369 Burglary 24790 FALSE ## 5 California CA 2014 38802500 Burglary 202670 FALSE ## 6 Colorado CO 2014 5355866 Burglary 23472 FALSE ``` --- ## `arrange()` `arrange()` sorts a data set by the values in one or more variables - successive variables break ties in previous ones - `desc()` stands for descending, otherwise rows are sorted from smallest to largest .left-third[
] .right-two-thirds[ .pull-left[
] .pull-right[
]] --- ## `arrange()` Example ```r fbi %>% arrange(desc(Year), Type, desc(Count)) %>% head() ``` ``` ## State Abb Year Population Type Count Violent.crime ## 1 California CA 2017 39536653 Aggravated.assault 104454 TRUE ## 2 Texas TX 2017 28304596 Aggravated.assault 76089 TRUE ## 3 Florida FL 2017 20984400 Aggravated.assault 58031 TRUE ## 4 New York NY 2017 19849399 Aggravated.assault 43819 TRUE ## 5 Tennessee TN 2017 6715984 Aggravated.assault 32432 TRUE ## 6 Illinois IL 2017 12802023 Aggravated.assault 32060 TRUE ``` --- ## `select()` Select specific variables of a data frame (vertical selection): `select(.data, ...)` specify all variables you want to keep Variables can be selected by index, e.g. `1:5`, by name (don't use quotes), or using a selector function, such as `starts_with()` Negative selection also works, e.g. `-1` (not the first variable) .left-third[
] .right-two-thirds[ .pull-left[
] .pull-right[
]] --- ## `select()` EXAMPLE Select `Type`, `Count`, `State`, and `Year` from the `fbi` data: ```r fbi %>% arrange(desc(Year), Type, desc(Count)) %>% select(Type, Count, State, Year) %>% head() ``` ``` ## Type Count State Year ## 1 Aggravated.assault 104454 California 2017 ## 2 Aggravated.assault 76089 Texas 2017 ## 3 Aggravated.assault 58031 Florida 2017 ## 4 Aggravated.assault 43819 New York 2017 ## 5 Aggravated.assault 32432 Tennessee 2017 ## 6 Aggravated.assault 32060 Illinois 2017 ``` --- background-image: url(https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/dplyr_mutate.png) background-size: 400px background-position: 50% 100% ## `mutate()` `mutate(.data, ...)` Introduce new variables into the data set or transform/update old variables multiple variables can be changed/introduced `mutate()` works sequentially: variables introduced become available in following changes --- ## `mutate()` EXAMPLE Introduce a variable Rate into the fbi data: ```r fbi %>% mutate(Rate = Count/Population*70000) %>% head() ``` ``` ## State Abb Year Population Type Count Violent.crime Rate ## 1 Alabama AL 1961 3302000 Murder.and.nonnegligent.Manslaughter 427 TRUE 9.052090 ## 2 Alabama AL 1962 3358000 Murder.and.nonnegligent.Manslaughter 316 TRUE 6.587254 ## 3 Alabama AL 1963 3347000 Murder.and.nonnegligent.Manslaughter 340 TRUE 7.110846 ## 4 Alabama AL 1964 3407000 Murder.and.nonnegligent.Manslaughter 316 TRUE 6.492515 ## 5 Alabama AL 1965 3462000 Murder.and.nonnegligent.Manslaughter 395 TRUE 7.986713 ## 6 Alabama AL 1966 3517000 Murder.and.nonnegligent.Manslaughter 384 TRUE 7.642877 ``` --- ## `summarise()` `summarize(.data, ...)` summarize observations into a (set of) one-number statistic(s): Creates a new dataset with 1 row and one column for each of the summary statistics .pull-left[
] .pull-left[
] --- ## `summarise()` EXAMPLE Calculate the mean and standard deviation of Crime rates in the `fbi` data ```r fbi %>% summarise(mean_rate = mean(Count/Population*70000, na.rm=TRUE), sd_rate = sd(Count/Population*70000, na.rm = TRUE)) ``` ``` ## mean_rate sd_rate ## 1 397.9617 612.8635 ``` --- ## `group_by()` & `summarise()` Power combo! for each combination of group levels, create one row of a (set of) one-number statistic(s) The new dataset has one column for each of the summary statistics, and one row for each combination of grouping levels (multiplicative) .left-third[
] .right-two-thirds[ .pull-left[
] .pull-right[
]] --- ## `group_by()` & `summarise()` example For each type of crime, calculate average crime rate and standard deviation. ```r fbi %>% group_by(Type) %>% summarise(mean_rate = mean(Count/Population*70000, na.rm=TRUE), sd_rate = sd(Count/Population*70000, na.rm = TRUE)) ``` ``` ## # A tibble: 8 x 3 ## Type mean_rate sd_rate ## <fct> <dbl> <dbl> ## 1 Aggravated.assault 167. 113. ## 2 Burglary 630. 311. ## 3 Larceny.theft 1657. 626. ## 4 Legacy.rape 20.2 10.8 ## 5 Motor.vehicle.theft 244. 158. ## 6 Murder.and.nonnegligent.Manslaughter 4.60 4.23 ## 7 Rape 29.6 12.3 ## 8 Robbery 88.5 103. ``` --- class: inverse, center, middle # Let's put these <br>tools to use --- ## French fries data ```r data(french_fries, package="reshape2") ``` - data from sensory experiment conducted at Iowa State University in 2004 - investigators were interested in comparing effects of three different fryer oils on taste of fries ```r french_fries %>% head() ``` ``` ## time treatment subject rep potato buttery grassy rancid painty ## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5 ## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0 ## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0 ## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0 ## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1 ## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3 ``` --- ## Did the french fries taste worse over time? How do we define 'tasting well'? - Higher values of potato-y and buttery, lower values of grassy, rancid, painty - Compute averages of these scales for each time point - combination of `group_by` and `summarise` --- ## Did the french fries taste worse over time? (2) ```r french_fries %>% group_by(time) %>% summarise( m.potato = mean(potato, na.rm=TRUE), m.buttery = mean(buttery, na.rm=TRUE), m.grassy = mean(grassy, na.rm=TRUE), m.rancid = mean(rancid, na.rm=TRUE), m.painty = mean(painty, na.rm=TRUE) ) %>% head() ``` ``` ## # A tibble: 6 x 6 ## time m.potato m.buttery m.grassy m.rancid m.painty ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 8.56 2.24 0.942 2.36 1.65 ## 2 2 8.06 2.72 1.18 2.85 1.44 ## 3 3 7.80 2.10 0.75 3.72 1.31 ## 4 4 7.71 1.80 0.742 3.60 1.37 ## 5 5 7.33 1.64 0.635 3.53 2.02 ## 6 6 6.67 1.75 0.674 4.08 2.34 ``` --- ## Did the french fries taste worse over time? (3) ```r avgs %>% ggplot(aes(x = time)) + geom_point(aes(y = m.potato)) + geom_point(shape=2, aes(y=m.rancid)) + ylab("Average") ``` ![](01_dplyr_files/figure-html/unnamed-chunk-35-1.png)<!-- --> --- class: yourturn # Your turn For this your turn use the `french_fries` data from the `reshape2` package: `data(french_fries, package="reshape2")` - Do ratings of potato-y show a difference between the different oils over time? - Draw a plot of the average potato-y rating by time, color by treatment. - How does this plot look like for the rancid rating? --- class: yourturn # Your turn For this your turn use the `french_fries` data from the `reshape2` package: `data(french_fries, package="reshape2")` - How much consistency do we see between ratings? For buttery and rancid ratings find the mean and the absolute difference between the two replicates (for each subject, each treatment and each time point). Use `dplyr` functions to find this summary dataset. - Are ratings more different between the two scales? What would you need to draw a side-by-side boxplot of the two ratings? Describe in words. - Are some subjects in the study more consistent than others? Show the variability in the potato-y ratings by subjects. Order subjects from least variability to most. --- class: yourturn # Your turn For this your turn use the `french_fries` data from the `reshape2` package: `data(french_fries, package="reshape2")` For each subject, determine how many records are missing (either absent or NA). Expand on the above summary of missing values by taking time into account. Plot the result: plot number of missing values by week, facet by subject. Is the result surprising? --- ## Resources - reference/document: http://dplyr.tidyverse.org/reference/ - RStudio cheat sheet for [dplyr](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) - Artwork by [@allison_horst](https://twitter.com/allison_horst?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) - https://datasciencebox.org/