class: center, middle, inverse, title-slide # Tidying data --- ## Outline: <br> ### - Dealing with messy (2) ### - `tidyr` functions: <br><br> - `separate()` & `unite()` ### - `readr` functions: <br><br> - `parse_number()` & others... --- class: inverse, center, middle # Messy (2): ## Multiple variables are <br>stored in one column --- ## one col `\(\rightarrow\)` multiple cols Use `separate()` to split values wherever a separator character appears `data %>%` <br> `separate(combined_var, into = c("var1", "var2"))` <br> `separate()` arguments: 1. the name of the column to separate 2. the names of the columns to separate into <br> Optional `separate()` arguments: - `sep` : a specific character to use to separate a column (interpreted as a regular expression) OR a vector of integers (positions to split at) - `convert = TRUE` : attempt to convert variable types to something more suitable than the existing type. - `remove = FALSE` : remove input column from output data frame. --- ```r df <- data_frame(x = c(NA, "a.1", "a.4", "b.3")) df ``` ``` ## # A tibble: 4 x 1 ## x ## <chr> ## 1 <NA> ## 2 a.1 ## 3 a.4 ## 4 b.3 ``` ```r df %>% separate(x, into = c("A", "B")) ``` ``` ## # A tibble: 4 x 2 ## A B ## <chr> <chr> ## 1 <NA> <NA> ## 2 a 1 ## 3 a 4 ## 4 b 3 ``` ```r df %>% separate(x, into = c("A", "B"), sep = "([.])", convert = TRUE, remove = FALSE) ``` ``` ## # A tibble: 4 x 3 ## x A B ## <chr> <chr> <int> ## 1 <NA> <NA> NA ## 2 a.1 a 1 ## 3 a.4 a 4 ## 4 b.3 b 3 ``` --- class: yourturn # Your Turn The Iowa Data Portal is a wealth of information on and about the State of Iowa. The website [Liquor Sales](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) provides data on every liquor sale in a licensed store in Iowa. The code below reads (part of) the data into an R session. ``` url <- "https://github.com/Stat579-at-ISU/materials/blob/master/03_tidyverse/data/Iowa_Liquor_Sales.csv.zip?raw=TRUE" download.file(url, "iowa.zip", mode="wb") iowa <- readr::read_csv("iowa.zip") ``` Assess the 'messiness' of the data. List issues that prevent us from working with the data directly. Which of these issues are of type (1) or (2) of messiness? --- ## Problems with the data: No **messy 1**? - problems of type **Messy 1** are typically hard to detect and often up to interpretation/dependent on the analysis to be done. <br> **messy 2**? - `Date` is text, in the format of Month/Day/Year - store location is a textual expression of form POINT (...) and geographic latitude and longitude. <br> **Other messiness**? --- background-image: url(https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/parse_number.png) background-size: 400px background-position: 50% 90% ## Some extra tidying Use `parse_number()` from the `readr` package to drop any non-numeric characters before or after the first number. Other parsers: `col_skip()`, `cols_condense()`, `cols()`, `parse_datetime()`, `parse_factor()`, `parse_guess()`, `parse_logical()`, `parse_vector()` --- class: yourturn # Your Turn - Check the help for the function `parse_number()` in the `readr` package and use it on store location. What result did you get? - Use `separate()` to split the date variable into year, month and day. - Use `separate()` again to extract geographic latitude and longitude <br> **Hint**: this will take multiple steps --- class: yourturn # Your Turn - exploration Use `dplyr` functionality to answer the following questions: 1. What is the total amount spent on Liquor Sales? 2. What is the single largest sale (in volume/in dollar amount)? 3. Plot geographic longitude and latitude. Where are liquor sales in Ames happening? --- ## Resources - reference/document: http://tidyr.tidyverse.org/reference/ - RStudio cheat sheet for [tidyr](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf) - Artwork by [@allison_horst](https://twitter.com/allison_horst?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)