class: center, middle, inverse, title-slide # Reshaping data
with
tidyr
--- class: inverse, center, top background-image: url(https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidyr_spread_gather.png) background-size: 450px background-position: 50% 70% # The `tidyr` package --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png) background-size: 120px background-position: 92% 5% ## Tidy data <br><br> ### Happy families are all alike; every unhappy </br> family is unhappy in its own way. <br></br> - Leo Tolstoy <br><br> **Resource**: follow along with the tidyr vignette - available as `vignette("tidy-data", package="tidyr")` - vignette is version of the [tidy data paper](https://www.jstatsoft.org/article/view/v059i10) with updated code --- ## Outline: <br> ### - What is tidy data? ### - Different sources of messiness ### - Key-Value pairs ### - `tidyr` functions: <br><br> - `pivot_longer()` & `pivot_wider()` --- ## Data is usually in a spreadsheet format, but ... There are different ways to encode the same information: .left-third[ .center[ <br> Option #1 ] ] .right-two-thirds[ ``` ## name treatmenta treatmentb ## 1 John Smith NA 18 ## 2 Jane Doe 4 1 ## 3 Mary Johnson 6 7 ``` ] .left-third[ .center[ <br><br> Option #2 <br> ] ] .right-two-thirds[ ``` ## treatment John.Smith Jane.Doe Mary.Johnson ## 1 a NA 4 6 ## 2 b 18 1 7 ``` ] .full[ Neither #1 nor #2 are "clean" versions of the data: observed information is part of the data structure; some implicit information is assumed ] --- background-image: url("images/r4ds_data-science-tidyverse.png") background-size: 600px background-position: 30% 75% ## Why does this matter? A "tidy" dataset, will be much easier to work with inside the tidyverse. --- background-image: url(https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png) background-size: 900px background-position: 50% 65% ## What is tidy data? 1. Each **variable** is one column. 2. Each **observation** is one row. 3. Each **value** must have its own cell. ??? A variable is a characteristic that is being measured, counted or described with data. Like: car type, salinity, year, population, or whale mass. An observation is a single “data point” for which the measure, count or description of one or more variables is recorded. For example, if you are recording variables height, mass, and color of dragons, then each dragon is an observation. A value is the recorded measure, count or description of a variable. --- ## Sources of Messiness 1. Column headers are values, not variable names.<br> e.g. *treatmenta, treatmentb* <br><br> 2. Multiple variables are stored in one column.<br> e.g. *Fall 2015, Spring 2016* or *"1301 8th St SE, Orange City, Iowa 51041 (42.99755, -96.04149)", "2102 Durant, Harlan, Iowa 51537 (41.65672, -95.33780)"* <br><br> 3. Multiple observational units are stored in the same table. <br><br> 4. A single observational unit is stored in multiple tables. --- ## Clean version of the example .left-third[ .center[ <br> Option #1 ] ] .right-two-thirds[ ``` ## name treatmenta treatmentb ## 1 John Smith NA 18 ## 2 Jane Doe 4 1 ## 3 Mary Johnson 6 7 ``` ] .left-third[ .center[ <br><br> Option #2 ] ] .right-two-thirds[ ``` ## treatment John.Smith Jane.Doe Mary.Johnson ## 1 a NA 4 6 ## 2 b 18 1 7 ``` ] .left-third[ .center[ <br><br> "tidy" <br> ] ] .right-two-thirds[ ``` ## treatment patient score ## 1 a John Smith NA ## 2 b John Smith 18 ## 3 a Jane Doe 4 ## 4 b Jane Doe 1 ## 5 a Mary Johnson 6 ## 6 b Mary Johnson 7 ``` ] --- ## Clean version of the example .left-third[ .center[ <br><br> "tidy" <br> ] ] .right-two-thirds[ ``` ## treatment patient score ## 1 a John Smith NA ## 2 b John Smith 18 ## 3 a Jane Doe 4 ## 4 b Jane Doe 1 ## 5 a Mary Johnson 6 ## 6 b Mary Johnson 7 ``` ] .full[ </br> - `treatment` and `patient` uniquely describe a single row in the dataset. - `treatment` and `patient` are **key variables**, - `score` is a **measurement variable** - this makes `treatment-patient` and `score` a **key-value pair** ] --- ## Key-value pairs (KVP) **Key-Value pairs** (KVP) - also *attribute-value*, *field-value*, *name-value*: abstract data representation that allows a lot of flexibility One way of telling whether a data set is tidy is to check that all keys for a value are aligned in one row: .pull-left[ .center[ ### Untidy data <img src="images/kvp-unhappy.png" width=150> ] ] .pull-right[ ### Tidy data <img src="images/kvp-happy.png" width=150> ] --- ## Tidying data: Plan of attack Very few functions are needed for tidying data: **Messy (1)**: `tidyr` functions `pivot_longer()` and `pivot_wider()`. - `pivot_longer(data, cols, names_to = "key", values_to = "value")`: take multiple columns and collapse into key-value pairs - `pivot_wider(data, names_from = name, values_from = key)`: spread a key-value pair across multiple columns. **Messy (2)**: `tidyr` function `separate(data, col, into, sep = "[^[:alnum:]]+")`: - separate one column into multiple columns **Messy (3)**: `dplyr` - some combination of the functions discussed previously **Messy (4)**: `dplyr` functionality `join` (and friends) to combine multiple data sets --- ## French fries data ```r data(french_fries, package="reshape2") ``` - data from sensory experiment conducted at Iowa State University in 2004 - investigators were interested in comparing effects of three different fryer oils on taste of fries ```r french_fries %>% head() ``` ``` ## time treatment subject rep potato buttery grassy rancid painty ## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5 ## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0 ## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0 ## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0 ## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1 ## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3 ``` --- ## wide data `\(\rightarrow\)` long data ```r french_fries %>% head() ``` ``` ## time treatment subject rep potato buttery grassy rancid painty ## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5 ## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0 ## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0 ## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0 ## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1 ## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3 ``` </br> To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables. To describe that operation we need three parameters: 1. The **set of columns** whose names are values, not variables. In this example, those are the columns `potato`, `buttery`, `grassy`, `rancid`, and `painty`. 2. The name of the variable to move the **column names** to. Here it is `scale`. 3. The name of the variable to move the **column values** to. Here it is `score` Together those parameters generate the call to `pivot_longer()`: --- background-image: url(images/pivot-longer-charts.png) background-size: 900px background-position: 30% 45% ## wide data `\(\rightarrow\)` long data ```r pivot_longer(data, cols = c("a1", "a2", "a3"), names_to = "key", values_to = "value") ``` .footnote[ see full graphic [here](https://swcarpentry.github.io/r-novice-gapminder/14-tidyr/) ] --- ## Tidying the french fries data ```r ff_long <- french_fries %>% pivot_longer(cols = potato:painty, names_to = "scale", values_to = "score") ff_long ``` ``` ## # A tibble: 3,480 x 6 ## time treatment subject rep scale score ## <fct> <fct> <fct> <dbl> <chr> <dbl> ## 1 1 1 3 1 potato 2.9 ## 2 1 1 3 1 buttery 0 ## 3 1 1 3 1 grassy 0 ## 4 1 1 3 1 rancid 0 ## 5 1 1 3 1 painty 5.5 ## 6 1 1 3 2 potato 14 ## 7 1 1 3 2 buttery 0 ## 8 1 1 3 2 grassy 0 ## 9 1 1 3 2 rancid 1.1 ## 10 1 1 3 2 painty 0 ## # … with 3,470 more rows ``` --- ## long data `\(\rightarrow\)` wide data ```r ff_long %>% head() ``` ``` ## # A tibble: 6 x 6 ## time treatment subject rep scale score ## <fct> <fct> <fct> <dbl> <chr> <dbl> ## 1 1 1 3 1 potato 2.9 ## 2 1 1 3 1 buttery 0 ## 3 1 1 3 1 grassy 0 ## 4 1 1 3 1 rancid 0 ## 5 1 1 3 1 painty 5.5 ## 6 1 1 3 2 potato 14 ``` Sometimes, we’ll have data that we want to spread over multiple columns. For our data, we’ll use `pivot_wider()` to spread the replicates across multiple columns `pivot_wider()` is the opposite of `pivot_longer()`. This time we need two parameters: - The column to take variable names from. Here, it's `rep`. - The column to take values from. Here it's `score`. --- background-image: url(images/pivot-wider-chart.png) background-size: 900px background-position: 45% 50% ## long data `\(\rightarrow\)` wide data ```r pivot_wider(data, values_from = value, names_from = key) ``` --- ## Widening the french fries data ```r ff_long %>% pivot_wider(names_from = rep, values_from = score) %>% head() ``` ``` ## # A tibble: 6 x 6 ## time treatment subject scale `1` `2` ## <fct> <fct> <fct> <chr> <dbl> <dbl> ## 1 1 1 3 potato 2.9 14 ## 2 1 1 3 buttery 0 0 ## 3 1 1 3 grassy 0 0 ## 4 1 1 3 rancid 0 1.1 ## 5 1 1 3 painty 5.5 0 ## 6 1 1 10 potato 11 9.9 ``` --- ## Widening the french fries data ```r ff_long %>% pivot_wider(names_from = rep, values_from = score) %>% ggplot(aes(x = `1`, y = `2`)) + geom_point() + geom_abline(colour = "grey50") + facet_wrap(~scale) ``` ![](05_tidyr_files/figure-html/unnamed-chunk-15-1.png)<!-- --> --- class: yourturn # your turn For this your turn use the `french_fries` data from the `reshape2` package: `data("french_fries", package="reshape2")` - Use `pivot_longer()` from the `tidyr` package to combine the different scales for assessing french fries into a single variable. Call the key-value pair "scale" and "score". - Use `pivot_wider()` from the `tidyr` package to get a format in which you can directly compare values from week 1 to week 10. Plot a scatterplot of values in week 1 against week 10. Facet by treatment and scale, color by individuals and use different shapes for the replicates. Is there a pattern visible? --- class: yourturn # Your turn For this your turn use the `fbiwide` data from the `classdata` package: `data("fbiwide", package="classdata")` - Use `pivot_longer()` from the `tidyr` package to combine the variables for the different types of crimes into one variable. Call the key-value pair "Type" and "Incidences". Compute a crime rate - Only consider crime rates for Iowa and Minnesota. Use `pivot_wider()` to create incidence columns for each of these states. Plot crimes in Iowa against crimes in Minnesota, colour by type of crime. Note: you need to exclude some variables. --- background-image: url(images/inverse.png) background-size: 900px background-position: 45% 70% ## `pivot_wider()` works (*almost*) as an inverse of `pivot_longer()` --- background-image: url(images/inverse-issue.png) background-size: 900px background-position: 45% 70% ## Front part is not identical -<br> not collapsible --- ## Where `pivot_wider()` is tricky - `pivot_wider()` is the (almost) inverse operation of `pivot_longer()` - when we use `pivot_longer()` the key information is duplicated - when using `pivot_wider()` we need to make sure that the information outside the key and value is identical for all levels (and combinations of levels) of the `key` variable. - If the info is not identical across levels, missing values are introduced into the new data set. --- ## `pivot_wider()` in the example ```r fbi %>% dplyr::filter(Year==2014, State %in% c("Iowa", "Minnesota"), Type=="Burglary") ``` ``` ## State Abb Year Population Type Count Violent.crime ## 1 Iowa IA 2014 3107126 Burglary 14428 FALSE ## 2 Minnesota MN 2014 5457173 Burglary 20773 FALSE ``` Other state specific values don't allow spread to collapse the front part and introduce missing values: ```r fbi %>% dplyr::filter(Year==2014, State %in% c("Iowa", "Minnesota"), Type=="Burglary") %>% pivot_wider(names_from=State, values_from=Count) ``` ``` ## # A tibble: 2 x 7 ## Abb Year Population Type Violent.crime Iowa Minnesota ## <chr> <int> <int> <fct> <lgl> <int> <int> ## 1 IA 2014 3107126 Burglary FALSE 14428 NA ## 2 MN 2014 5457173 Burglary FALSE NA 20773 ``` --- ## fixing `pivot_wider()` in the ex. Negative selection of all state specific variables gets the example to "work": ```r fbi %>% dplyr::filter(Year==2014, State %in% c("Iowa", "Minnesota"), Type=="Burglary") %>% select(-Abb, -Population) %>% pivot_wider(names_from = State, values_from = Count) ``` ``` ## # A tibble: 1 x 5 ## Year Type Violent.crime Iowa Minnesota ## <int> <fct> <lgl> <int> <int> ## 1 2014 Burglary FALSE 14428 20773 ``` --- class: yourturn # Your turn - Why are `pivot_longer()` and `pivot_wider()` not perfectly symmetrical? Carefully consider the following example: ```r stocks <- tibble( year = c(2015, 2015, 2016, 2016), half = c( 1, 2, 1, 2), return = c(1.88, 0.59, 0.92, 0.17) ) stocks %>% pivot_wider(names_from = year, values_from = return) %>% pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return") ``` </br> **Hint**: look at the variable types and think about column names **Hint**: `pivot_longer()` has a `names_ptype` argument, e.g. `names_ptype = list(year = double())`. What does it do? --- ## Resources - reference/document: http://tidyr.tidyverse.org/reference/ - RStudio cheat sheet for [tidyr](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf) - Artwork by [@allison_horst](https://twitter.com/allison_horst?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) - Chart infrographics adapted from [Software Carpentry](https://swcarpentry.github.io/r-novice-gapminder/14-tidyr/)