Tidying data

class: center, middle, inverse, title-slide

# Tidying data

---

## Outline: 
### - Dealing with messy (2)

### - `tidyr` functions: &nbsp;&nbsp;&nbsp;&nbsp; - `separate()` & `unite()`

### - `readr` functions: &nbsp;&nbsp;&nbsp;&nbsp; - `parse_number()` & others...

---
class: inverse, center, middle
# Messy (2): 
## Multiple variables are stored in one column

---
## one col `\(\rightarrow\)` multiple cols

Use `separate()` to split values wherever a separator character appears

&nbsp;&nbsp;&nbsp;&nbsp; `data %>%` &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `separate(combined_var, into = c("var1", "var2"))`

`separate()` arguments:

1. the name of the column to separate 
 
 2. the names of the columns to separate into
 
 
Optional `separate()` arguments:

- `sep` : a specific character to use to separate a column (interpreted as a regular expression) OR a vector of integers (positions to split at)

- `convert = TRUE` : attempt to convert variable types to something more suitable than the existing type.

- `remove = FALSE` : remove input column from output data frame.
---

```r
df <- data_frame(x = c(NA, "a.1", "a.4", "b.3"))
df 
```

```
## # A tibble: 4 x 1
## x 
## <chr>
## 1 <NA> 
## 2 a.1 
## 3 a.4 
## 4 b.3
```

```r
df %>% separate(x, into = c("A", "B")) 
```

```
## # A tibble: 4 x 2
## A B 
## <chr> <chr>
## 1 <NA> <NA> 
## 2 a 1 
## 3 a 4 
## 4 b 3
```

```r
df %>% separate(x, into = c("A", "B"), sep = "([.])", convert = TRUE, remove = FALSE)
```

```
## # A tibble: 4 x 3
## x A B
## <chr> <chr> <int>
## 1 <NA> <NA> NA
## 2 a.1 a 1
## 3 a.4 a 4
## 4 b.3 b 3
```

---
class: yourturn
# Your Turn

The Iowa Data Portal is a wealth of information on and about the State of Iowa.

The website [Liquor Sales](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) provides data on every liquor sale in a licensed store in Iowa. The code below reads (part of) the data into an R session.

```
url <- "https://github.com/Stat579-at-ISU/materials/blob/master/03_tidyverse/data/Iowa_Liquor_Sales.csv.zip?raw=TRUE"
download.file(url, "iowa.zip", mode="wb")
iowa <- readr::read_csv("iowa.zip")
```

Assess the 'messiness' of the data. List issues that prevent us from working with the data directly. Which of these issues are of type (1) or (2) of messiness?

---
## Problems with the data:

No **messy 1**?  
- problems of type **Messy 1** are typically hard to detect and often up to interpretation/dependent on the analysis to be done.

**messy 2**? 
- `Date` is text, in the format of Month/Day/Year

- store location is a textual expression of form POINT (...) and geographic latitude and longitude.

**Other messiness**?

---
background-image: url(https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/parse_number.png)
background-size: 400px
background-position: 50% 90%
## Some extra tidying

Use `parse_number()` from the `readr` package to drop any non-numeric characters before or after the first number.

Other parsers: `col_skip()`, `cols_condense()`, `cols()`, `parse_datetime()`, `parse_factor()`, `parse_guess()`, `parse_logical()`, `parse_vector()`

---
class: yourturn
# Your Turn

-  Check the help for the function `parse_number()` in the `readr` package and use it on store location. What result did you get?

- Use `separate()` to split the date variable into year, month and day.

- Use `separate()` again to extract geographic latitude and longitude **Hint**: this will take multiple steps

---
class: yourturn
# Your Turn - exploration

Use `dplyr` functionality to answer the following questions:

1. What is the total amount spent on Liquor Sales?

2. What is the single largest sale (in volume/in dollar amount)?

3. Plot geographic longitude and latitude. Where are liquor sales in Ames happening?

---
## Resources

- reference/document: http://tidyr.tidyverse.org/reference/

- RStudio cheat sheet for [tidyr](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf)

- Artwork by [@allison_horst](https://twitter.com/allison_horst?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)