Cleaning data: messy 3 & 4

class: center, middle, inverse, title-slide

# Cleaning data: messy 3 & 4

---

class: inverse, middle, center

# Messy (3): 
## Multiple observational units are stored in the same table.

---
class: inverse, center, middle
# Keys & Measurements

---
## Finding your keys - Example (1)

100 patients are randomly assigned to a treatment for heart attack, measured 5 different clinical outcomes.

- **key**: patient ID

- **factor variable (design)**: treatment

- **measured variables**: 5 clinical outcomes

---
## Finding your keys - Example (2)

Randomized complete block trial with four fields, four different types of fertilizer, over four years.

Recorded total corn yield, and fertilizer run off.

- **key**: fields, types of fertilizer, year

- **measurement**: total corn yield, fertilizer run off

---
## Finding your keys - Example (3)

Cluster sample of twenty students in thirty different schools.  For each school, recorded distance from ice rink.

For each student, asked how often they go ice skating, and whether or not their parents like ice skating

- **key**: student ID, school ID

- **measurement**: distance to rink, # of times ice skating, parents' preference

---
## Finding your keys - Example (4)

For each person, recorded age, sex, height and target weight, and then at multiple times recorded their weight

- **key**: *patient ID*, date

- **measurement**: *age, sex, height, target weight*, current weight

&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; *only patient ID is needed for variables in italics*

---
## Messy (3)

**Messy (3)**: Multiple observational units are stored in the same table.

**What does that mean?** The key is split,  
- i.e. for some values all key variables are necessary, while other values only need some key variables.

![](images/normal-not-2.png)

---
## Why do we need to take care of split keys?

- Data redundancy introduces potential problems  
  - same student *should* have the same student ID

- to check data consistency, we split data set into parts  
  - this process is called **normalizing**

- normalization reduces overall data size

- useful way of thinking about objects under study

---
## Tidying Messy (3)

Splitting into separate datasets:

![](images/normal-split.png)

---
## Example: Box office gross

The-Numbers website publishes [weekly charts](http://www.the-numbers.com/weekend-box-office-chart) of the gross income of all movies playing across the US. A set of cleaned data called `box` with movies for the last five years is available in the `classdata` package.

```r
# devtools::install_github("haleyjeppson/classdata")
data(box, package = "classdata")
head(box, 4)
```

```
##   Rank Rank.Last.Week             Movie        Distributor    Gross Change Thtrs. Per.Thtr.
## 1    1              1             Joker       Warner Bros. 55861403    -42   4374     12771
## 2    2             NA The Addams Family     United Artists 30300007     NA   4007      7562
## 3    3             NA        Gemini Man Paramount Pictures 20552372     NA   3642      5643
## 4    4              2        Abominable          Universal  6072235    -49   3496      1737
##   Total.Gross Week       Date
## 1   193590190    2 2019-10-11
## 2    30300007    1 2019-10-11
## 3    20552372    1 2019-10-11
## 4    47873585    3 2019-10-11
```

What are the key variables? Why is the key split?

---
## Keys and measurements

- **Key variables**: `Movie` name,  `Date` and `Distributor`.

- **Measurement variables**: `Gross`, `Thtrs.`

- All other variables are derived from these variables

- **Good practice**: re-calculate the derived variabes to check for consistency.

---
## Taking care of the split key

**Plan**:  
- separate movie information from box office information

**Idea for separation**:  
- we want to get a set of movies together with their distributor and *ideally* with their release date (which we do not have)

- instead of release date, we can get the date of the first time that we see a movie in the boxoffice

- let's also keep track of how many weeks a movie has been released at that time (should be 1 - when will it be different?)

---
class: yourturn
# Your turn

For this your turn use the `box` data from the `classdata` package.

- **Big goal**: we want to create a new dataset `movie` that consists of movie, distributor, date of first time the movie shows up in the box office, and the number of weeks the movie has been released at that time.

- What are the key variables for the new dataset?

- For the key variable(s), use `summarize()` to find the first time a movie shows up in the box office and find the related number of weeks.

---
## Key variables

Does `Movie` uniquely describe a movie?

```r
movies <- box %>% select(Movie, Distributor) %>% distinct()
```

Does that make a movie unique?

```r
movies %>% count(Movie) %>% arrange(desc(n))
```

```
## # A tibble: 4,144 x 2
## Movie n
## <chr> <int>
## 1 Beauty and the Beast 2
## 2 Breathe 2
## 3 Champion 2
## 4 Concussion 2
## 5 Entertainment 2
## 6 Girlhood 2
## 7 Gold 2
## 8 Holy Hell 2
## 9 Iceman 2
## 10 Leviathan 2
## # … with 4,134 more rows
```

---
## Movie data - take 2

Get the week info for the first time we see each movie and distributor combo:

```r
movies <- box %>% 
 group_by(Movie, Distributor) %>% 
 summarise(
 firstDate = Date[which.min(Week)], 
 firstWeek = min(Week, na.rm=TRUE))

head(movies)
```

```
## # A tibble: 6 x 4
## # Groups: Movie [6]
## Movie Distributor firstDate firstWeek
## <chr> <chr> <date> <dbl>
## 1 '71 Roadside Attractions 2015-02-27 1
## 2 [REC] 4: Apocalypse Magnolia Pictures 2015-01-02 1
## 3 1,000 Rupee Note Kino Lorber 2016-09-23 1
## 4 1,000 Times Goodnight Film Movement 2014-10-24 1
## 5 10 Cloverfield Lane Paramount Pictures 2016-03-11 1
## 6 10 Days in a Madhouse Cafe Pictures 2015-11-20 2
```

---
## Looking into inconsistencies

```r
movies %>% group_by(Movie) %>% mutate(n = n()) %>% arrange(desc(n))
```

```
## # A tibble: 4,172 x 5
## # Groups: Movie [4,144]
## Movie Distributor firstDate firstWeek n
## <chr> <chr> <date> <dbl> <int>
## 1 Beauty and the Beast Lopert Pictures Cor... 2016-02-12 1 2
## 2 Beauty and the Beast Walt Disney 2017-03-17 1 2
## 3 Breathe Bleecker Street 2017-10-13 1 2
## 4 Breathe Film Movement 2015-09-11 1 2
## 5 Champion ArtAffects 2017-05-26 2 2
## 6 Champion Well Go USA 2018-05-11 1 2
## 7 Concussion Radius 2013-10-04 1 2
## 8 Concussion Sony Pictures 2015-12-25 1 2
## 9 Entertainment B4U Movies 2014-08-08 1 2
## 10 Entertainment Magnolia Pictures 2015-11-13 1 2
## # … with 4,162 more rows
```

---
## Using IMDb: Beauty and the Beast

According to IMDb there are at least four movies released with the name Beauty and the Beast: 2017 (Walt Disney Pictures), 2014 (Spanish Production), 1991 (Walt Disner Pictures), and 1946 (French Production, released by Lopert Films in the US).

Can't distinguish between the 2017 and the 1991 movie. The Lopert produced movie was only shown for two weeks in at most 3 theaters.

```r
box %>% dplyr::filter(Movie=="Beauty and the Beast") %>% tail(6)
```

```
##    Rank Rank.Last.Week                Movie            Distributor     Gross Change Thtrs. Per.Thtr.
## 14    2              2 Beauty and the Beast            Walt Disney  23652605    -48   3969      5959
## 15    2              1 Beauty and the Beast            Walt Disney  45420743    -50   4210     10789
## 16    1              1 Beauty and the Beast            Walt Disney  90426717    -48   4210     21479
## 17    1             NA Beauty and the Beast            Walt Disney 174750616     NA   4210     41508
## 18   72             65 Beauty and the Beast Lopert Pictures Cor...      4718    -43      3      1573
## 19   65             NA Beauty and the Beast Lopert Pictures Cor...      8264     NA      1      8264
##    Total.Gross Week       Date
## 14   430946639    4 2017-04-07
## 15   393337585    3 2017-03-31
## 16   319032604    2 2017-03-24
## 17   174750616    1 2017-03-17
## 18       20324    2 2016-02-19
## 19        8264    1 2016-02-12
```

---
## Using IMDb (2)

- Girlhood is the name of two movies - one that was released in 2003, one in 2014; most likely the Oct 4 boxoffice mention is only mistakenly referring to the 2003 movie

```r
box %>% dplyr::filter(Movie=="Girlhood") %>% head(6)
```

```
##   Rank Rank.Last.Week    Movie Distributor Gross Change Thtrs. Per.Thtr. Total.Gross Week       Date
## 1   80             88 Girlhood      Strand  1763    278      1      1763       60019   15 2015-05-08
## 2   88             99 Girlhood      Strand   467    -67      1       467       58148   14 2015-05-01
## 3   99            100 Girlhood      Strand  1431    343      3       477       56566   13 2015-04-24
## 4  100            102 Girlhood      Strand   323     55      1       323       54712   12 2015-04-17
## 5  101             NA Girlhood               209     NA      1       209       53938  598 2015-04-10
## 6  102             81 Girlhood      Strand   209    -90      5        42       52907   11 2015-04-10
```

---
## Using IMDb:

- Mama Africa refers to two movies, one released in 2002, one in 2011; likely the duplicate  on Jan 19 is erroneous, but we still don't know which of the two movies is showing (in 1 theater)

```r
box %>% dplyr::filter(Movie=="Mama Africa")
```

```
##   Rank Rank.Last.Week       Movie           Distributor Gross Change Thtrs. Per.Thtr. Total.Gross
## 1   88             79 Mama Africa ArtMattan Productions   481    -72      2       241        9808
## 2   79             67 Mama Africa ArtMattan Productions  1689    -52      2       845        8123
## 3   67             NA Mama Africa ArtMattan Productions  3495     NA      1      3495        3495
## 4   69             NA Mama Africa                        3187     NA      1      3187        3187
##   Week       Date
## 1    3 2018-02-02
## 2    2 2018-01-26
## 3    1 2018-01-19
## 4  816 2018-01-19
```

---
## why do we normalize?

- Normalization helps identify inconsistencies in data

- Checking up on inconsistencies is a lot of manual labor

---
## Resources

- reference/document: http://tidyr.tidyverse.org/reference/

- RStudio cheat sheet for [tidyr](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf)