Due date: the homework is due before class on Thursday.
Submission process: submit both the R Markdown file and the corresponding html file on canvas. Please submit both the .Rmd
and the .html
files separately and do not zip the two files together.
Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.
For this homework we use the data set box
from the classdata
package, which consists of weekly box office gross for movies of the last five year.
# devtools::install_github("haleyjeppson/classdata")
library(classdata)
library(tidyverse)
head(box)
## Rank Rank.Last.Week Movie Distributor Gross Change
## 1 1 1 Joker Warner Bros. 55861403 -42
## 2 2 NA The Addams Family United Artists 30300007 NA
## 3 3 NA Gemini Man Paramount Pictures 20552372 NA
## 4 4 2 Abominable Universal 6072235 -49
## 5 5 3 Downton Abbey Focus Features 4881075 -39
## 6 6 4 Hustlers STX Entertainment 3887018 -39
## Thtrs. Per.Thtr. Total.Gross Week Date
## 1 4374 12771 193590190 2 2019-10-11
## 2 4007 7562 30300007 1 2019-10-11
## 3 3642 5643 20552372 1 2019-10-11
## 4 3496 1737 47873585 3 2019-10-11
## 5 3019 1617 82668665 4 2019-10-11
## 6 2357 1649 98052357 5 2019-10-11
movies <- box %>%
group_by(Movie, Distributor) %>%
summarise(
firstDate = Date[which.min(Week)],
firstWeek = min(Week, na.rm=TRUE))
movies %>%
group_by(Movie) %>%
mutate(n = n()) %>%
filter(n>1) %>%
arrange(desc(n))
## # A tibble: 56 x 5
## # Groups: Movie [28]
## Movie Distributor firstDate firstWeek n
## <chr> <chr> <date> <dbl> <int>
## 1 Beauty and the Beast Lopert Pictures Cor... 2016-02-12 1 2
## 2 Beauty and the Beast Walt Disney 2017-03-17 1 2
## 3 Breathe Bleecker Street 2017-10-13 1 2
## 4 Breathe Film Movement 2015-09-11 1 2
## 5 Champion ArtAffects 2017-05-26 2 2
## 6 Champion Well Go USA 2018-05-11 1 2
## 7 Concussion Radius 2013-10-04 1 2
## 8 Concussion Sony Pictures 2015-12-25 1 2
## 9 Entertainment B4U Movies 2014-08-08 1 2
## 10 Entertainment Magnolia Pictures 2015-11-13 1 2
## # … with 46 more rows
box %>% dplyr::filter(Movie=="Concussion")
## Rank Rank.Last.Week Movie Distributor Gross Change Thtrs.
## 1 56 48 Concussion Sony Pictures 25212 -40 53
## 2 48 64 Concussion Sony Pictures 42106 252 93
## 3 64 56 Concussion Sony Pictures 11945 -59 24
## 4 56 44 Concussion Sony Pictures 29305 -46 45
## 5 44 39 Concussion Sony Pictures 54256 -40 56
## 6 39 36 Concussion Sony Pictures 90283 -44 116
## 7 36 28 Concussion Sony Pictures 160954 -38 315
## 8 28 19 Concussion Sony Pictures 258401 -73 315
## 9 19 10 Concussion Sony Pictures 947443 -69 1041
## 10 10 8 Concussion Sony Pictures 3029367 -61 2056
## 11 8 7 Concussion Sony Pictures 7846281 -25 2841
## 12 7 NA Concussion Sony Pictures 10513749 NA 2841
## 13 67 NA Concussion Radius 13897 NA 25
## 14 72 NA Concussion Radius 8216 NA 2
## Per.Thtr. Total.Gross Week Date
## 1 476 34531832 12 2016-03-11
## 2 453 34485471 11 2016-03-04
## 3 498 34436742 10 2016-02-26
## 4 651 34411070 9 2016-02-19
## 5 969 34354922 8 2016-02-12
## 6 778 34255169 7 2016-02-05
## 7 511 34099541 6 2016-01-29
## 8 820 33806210 5 2016-01-22
## 9 910 32991673 4 2016-01-15
## 10 1473 30947645 3 2016-01-08
## 11 2762 25266650 2 2016-01-01
## 12 3701 10513749 1 2015-12-25
## 13 556 38361 3 2013-10-18
## 14 4108 8216 1 2013-10-04
Two different movies, both titled ‘Concussion’ have been released. One was released in 2013, and the other in 2015. Below are the IMBd links to each respective movie:
2013 version: https://www.imdb.com/title/tt2296697/
2015 version: https://www.imdb.com/title/tt3322364/
The strategy I used to find this movie was the same one we went over in class. First, in line 44 I created a dataset called ‘movies’ that filters the ‘box’ data for distinct combinations of Movie and Distributor. In line 45, I count how many times every movie title appears in the ‘movies’ data set. Thus, movies with the same title but different distributors are counted under the same movie name. So, for the movie “Concussion”, its count was 2. This count of 2 reflects that the movie was released under two different distributors (different movie, same name). The 2013 version’s distributor is Radius, while the 2015 version’s distributor is Sony Pictures. To get the specifics on these movies, I filtered the original ‘box’ data set to look at only where the title ‘Concussion’ shows up. I did this in line 46.
Change
(percent change in gross income from last week), Rank.Last.Week
, Per.Thtr.
(as gross per theater), and Total.Gross
(as the cumulative sum of weekly gross).new_box <- box %>%
group_by(Movie, Distributor) %>%
arrange(Movie, Week) %>%
mutate(per_theater = round(Gross/Thtrs., 0),
total_gross = cumsum(Gross),
rank_last_week = lag(Rank, order_by = Week),
week_test = lag(Week, order_by = Week),
rank_last_week = ifelse(week_test == Week - 1, rank_last_week, NA),
change = round((Gross/lag(Gross, order_by = Week) - 1)*100,0),
change = ifelse(week_test == Week - 1, change, NA)) %>%
select(-week_test)
Per.Thtr.
and Change
compare the original variables and the newly derived ones. Are there differences? Where? Try to describe patterns you find.new_box %>%
group_by(Movie, Distributor) %>%
mutate(per_theater_diffs = Per.Thtr. - per_theater) %>%
summarise(max_diff = max(per_theater_diffs, na.rm = TRUE)) %>%
arrange(desc(max_diff))
## # A tibble: 4,172 x 3
## # Groups: Movie [4,144]
## Movie Distributor max_diff
## <chr> <chr> <dbl>
## 1 1,000 Times Goodnight Film Movement 1
## 2 1945 Menemsha Entertainment 1
## 3 20 Feet From Stardom RADiUS-TWC 1
## 4 20,000 Days on Earth Drafthouse Films 1
## 5 2019 Oscar Shorts ShortsHD 1
## 6 3 Faces Kino Lorber 1
## 7 30 Years of Garbage: The Ga... Indican Pictures 1
## 8 50 to 1 Ten Furlongs 1
## 9 56 Up First Run Features 1
## 10 7 Witches Indican Pictures 1
## # … with 4,162 more rows
The only differences in the original and newly derived Per.Thtr.
variables look to be rounding errors.
new_box %>%
group_by(Movie, Distributor) %>%
mutate(change_diffs = Change - change) %>%
summarise(max_diff = max(change_diffs, na.rm = TRUE)) %>%
arrange(desc(max_diff))
## # A tibble: 4,172 x 3
## # Groups: Movie [4,144]
## Movie Distributor max_diff
## <chr> <chr> <dbl>
## 1 Tracks Weinstein Co. 26755
## 2 Not Today Freestyle Releasing 813
## 3 Phoenix, Oregon Joma Films 648
## 4 K2: Siren of the Himalayas First Run Features 248
## 5 The 100-Year-Old Man Who Cl... Music Box Films 151
## 6 Inequality for All RADiUS-TWC 138
## 7 Watermark Entertainment One 129
## 8 The Hunger Games: Mockingja... Lionsgate 115
## 9 Boy Meets Girl Bramacharia LLC 108
## 10 Gemma Bovery Music Box Films 76
## # … with 4,162 more rows
new_box %>%
filter(Movie == "Tracks") %>%
arrange(Week) %>%
select(Date, Week, Rank, Rank.Last.Week, rank_last_week, Gross, Change, change, Per.Thtr., per_theater, Total.Gross, total_gross)
## # A tibble: 22 x 14
## # Groups: Movie, Distributor [1]
## Movie Distributor Date Week Rank Rank.Last.Week rank_last_week Gross
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Trac… Weinstein … 2014-09-19 1 NA NA NA 260
## 2 Trac… Weinstein … 2014-08-29 1 86 NA NA 1213
## 3 Trac… Weinstein … 2014-06-06 1 62 NA NA 13285
## 4 Trac… Weinstein … 2014-09-26 2 40 NA 62 70951
## 5 Trac… Weinstein … 2014-06-13 2 59 62 NA 15995
## 6 Trac… Weinstein … 2014-10-03 3 36 40 59 92899
## 7 Trac… Weinstein … 2014-06-20 3 54 59 NA 17049
## 8 Trac… Weinstein … 2014-10-10 4 45 36 54 64381
## 9 Trac… Weinstein … 2014-06-27 4 55 54 NA 22670
## 10 Trac… Weinstein … 2014-10-17 5 52 45 55 34563
## # … with 12 more rows, and 6 more variables: Change <dbl>, change <dbl>,
## # Per.Thtr. <dbl>, per_theater <dbl>, Total.Gross <dbl>, total_gross <dbl>
There are some issues in how the variableWeek
is derived. It looks like for some movies, if the movie falls off the charts and returns to the charts again, the Week
count starts back at 1 which would be incorrect.
Total.Gross
strictly increasing?ggplot(new_box, aes(x = Week, group = interaction(Movie, Distributor))) +
geom_line(aes(y = Gross), alpha= .4, color = "grey20") + xlim(0, 50)
ggplot(new_box, aes(x = Date, group = interaction(Movie, Distributor))) +
geom_line(aes(y = Total.Gross), alpha= .4, color = "grey20")
No, there look to be some issues with the data as it should be strictly increasing.
Date
variable.new_box %>%
separate(Date, into = c("year", "month", "date"), remove = FALSE) %>%
group_by(year, Movie) %>%
summarise(Total.Gross.Max = max(Total.Gross)) %>%
arrange(desc(year), desc(Total.Gross.Max)) %>%
group_by(year) %>%
mutate(rank = min_rank(desc(Total.Gross.Max))) %>%
arrange(desc(year), rank) %>%
filter(rank < 4)
## # A tibble: 21 x 4
## # Groups: year [7]
## year Movie Total.Gross.Max rank
## <chr> <chr> <dbl> <int>
## 1 2019 Avengers: Endgame 858366365 1
## 2 2019 The Lion King 542287663 2
## 3 2019 Toy Story 4 433560473 3
## 4 2018 Black Panther 700006415 1
## 5 2018 Avengers: Infinity War 678809000 2
## 6 2018 Star Wars Ep. VIII: The Las... 620178595 3
## 7 2017 Titanic 659328801 1
## 8 2017 Rogue One: A Star Wars Story 532171696 2
## 9 2017 Star Wars Ep. VIII: The Las... 517218368 3
## 10 2016 Star Wars Ep. VII: The Forc... 936658640 1
## # … with 11 more rows