Submission Details

Due date: the homework is due before class on Thursday.

Submission process: submit both the R Markdown file and the corresponding html file on canvas. Please submit both the .Rmd and the .html files separately and do not zip the two files together.


Weekly box office data

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.

  2. For this homework we use the data set box from the classdata package, which consists of weekly box office gross for movies of the last five year.

# devtools::install_github("haleyjeppson/classdata")
library(classdata)
library(tidyverse)
head(box)
##   Rank Rank.Last.Week             Movie        Distributor    Gross Change
## 1    1              1             Joker       Warner Bros. 55861403    -42
## 2    2             NA The Addams Family     United Artists 30300007     NA
## 3    3             NA        Gemini Man Paramount Pictures 20552372     NA
## 4    4              2        Abominable          Universal  6072235    -49
## 5    5              3     Downton Abbey     Focus Features  4881075    -39
## 6    6              4          Hustlers  STX Entertainment  3887018    -39
##   Thtrs. Per.Thtr. Total.Gross Week       Date
## 1   4374     12771   193590190    2 2019-10-11
## 2   4007      7562    30300007    1 2019-10-11
## 3   3642      5643    20552372    1 2019-10-11
## 4   3496      1737    47873585    3 2019-10-11
## 5   3019      1617    82668665    4 2019-10-11
## 6   2357      1649    98052357    5 2019-10-11
  1. In class we discussed two instances where a movie was released under the same name as a different movie previously. Identify at least one more instance of a movie where that happened. Report the name of the movie, search online for additional information. Describe the strategy you used to identify this movie, report the code involved.
movies <- box %>% 
  group_by(Movie, Distributor) %>% 
  summarise(
    firstDate = Date[which.min(Week)], 
    firstWeek = min(Week, na.rm=TRUE))

movies %>% 
  group_by(Movie) %>% 
  mutate(n = n()) %>% 
  filter(n>1) %>% 
  arrange(desc(n))
## # A tibble: 56 x 5
## # Groups:   Movie [28]
##    Movie                Distributor            firstDate  firstWeek     n
##    <chr>                <chr>                  <date>         <dbl> <int>
##  1 Beauty and the Beast Lopert Pictures Cor... 2016-02-12         1     2
##  2 Beauty and the Beast Walt Disney            2017-03-17         1     2
##  3 Breathe              Bleecker Street        2017-10-13         1     2
##  4 Breathe              Film Movement          2015-09-11         1     2
##  5 Champion             ArtAffects             2017-05-26         2     2
##  6 Champion             Well Go USA            2018-05-11         1     2
##  7 Concussion           Radius                 2013-10-04         1     2
##  8 Concussion           Sony Pictures          2015-12-25         1     2
##  9 Entertainment        B4U Movies             2014-08-08         1     2
## 10 Entertainment        Magnolia Pictures      2015-11-13         1     2
## # … with 46 more rows
box %>% dplyr::filter(Movie=="Concussion")
##    Rank Rank.Last.Week      Movie   Distributor    Gross Change Thtrs.
## 1    56             48 Concussion Sony Pictures    25212    -40     53
## 2    48             64 Concussion Sony Pictures    42106    252     93
## 3    64             56 Concussion Sony Pictures    11945    -59     24
## 4    56             44 Concussion Sony Pictures    29305    -46     45
## 5    44             39 Concussion Sony Pictures    54256    -40     56
## 6    39             36 Concussion Sony Pictures    90283    -44    116
## 7    36             28 Concussion Sony Pictures   160954    -38    315
## 8    28             19 Concussion Sony Pictures   258401    -73    315
## 9    19             10 Concussion Sony Pictures   947443    -69   1041
## 10   10              8 Concussion Sony Pictures  3029367    -61   2056
## 11    8              7 Concussion Sony Pictures  7846281    -25   2841
## 12    7             NA Concussion Sony Pictures 10513749     NA   2841
## 13   67             NA Concussion        Radius    13897     NA     25
## 14   72             NA Concussion        Radius     8216     NA      2
##    Per.Thtr. Total.Gross Week       Date
## 1        476    34531832   12 2016-03-11
## 2        453    34485471   11 2016-03-04
## 3        498    34436742   10 2016-02-26
## 4        651    34411070    9 2016-02-19
## 5        969    34354922    8 2016-02-12
## 6        778    34255169    7 2016-02-05
## 7        511    34099541    6 2016-01-29
## 8        820    33806210    5 2016-01-22
## 9        910    32991673    4 2016-01-15
## 10      1473    30947645    3 2016-01-08
## 11      2762    25266650    2 2016-01-01
## 12      3701    10513749    1 2015-12-25
## 13       556       38361    3 2013-10-18
## 14      4108        8216    1 2013-10-04

Two different movies, both titled ‘Concussion’ have been released. One was released in 2013, and the other in 2015. Below are the IMBd links to each respective movie:

2013 version: https://www.imdb.com/title/tt2296697/

2015 version: https://www.imdb.com/title/tt3322364/

The strategy I used to find this movie was the same one we went over in class. First, in line 44 I created a dataset called ‘movies’ that filters the ‘box’ data for distinct combinations of Movie and Distributor. In line 45, I count how many times every movie title appears in the ‘movies’ data set. Thus, movies with the same title but different distributors are counted under the same movie name. So, for the movie “Concussion”, its count was 2. This count of 2 reflects that the movie was released under two different distributors (different movie, same name). The 2013 version’s distributor is Radius, while the 2015 version’s distributor is Sony Pictures. To get the specifics on these movies, I filtered the original ‘box’ data set to look at only where the title ‘Concussion’ shows up. I did this in line 46.

  1. Re-derive variables: Change (percent change in gross income from last week), Rank.Last.Week, Per.Thtr. (as gross per theater), and Total.Gross (as the cumulative sum of weekly gross).
new_box <- box %>% 
  group_by(Movie, Distributor)  %>%  
  arrange(Movie, Week) %>% 
  mutate(per_theater = round(Gross/Thtrs., 0),
           total_gross = cumsum(Gross),
         rank_last_week = lag(Rank, order_by = Week),
         week_test = lag(Week, order_by = Week),
         rank_last_week = ifelse(week_test == Week - 1, rank_last_week, NA),
         change = round((Gross/lag(Gross, order_by = Week) - 1)*100,0),
         change = ifelse(week_test == Week - 1, change, NA)) %>%
  select(-week_test)
  1. For the variables Per.Thtr. and Change compare the original variables and the newly derived ones. Are there differences? Where? Try to describe patterns you find.
new_box %>% 
  group_by(Movie, Distributor)  %>%  
  mutate(per_theater_diffs = Per.Thtr. - per_theater) %>% 
  summarise(max_diff = max(per_theater_diffs, na.rm = TRUE)) %>% 
  arrange(desc(max_diff))
## # A tibble: 4,172 x 3
## # Groups:   Movie [4,144]
##    Movie                          Distributor            max_diff
##    <chr>                          <chr>                     <dbl>
##  1 1,000 Times Goodnight          Film Movement                 1
##  2 1945                           Menemsha Entertainment        1
##  3 20 Feet From Stardom           RADiUS-TWC                    1
##  4 20,000 Days on Earth           Drafthouse Films              1
##  5 2019 Oscar Shorts              ShortsHD                      1
##  6 3 Faces                        Kino Lorber                   1
##  7 30 Years of Garbage: The Ga... Indican Pictures              1
##  8 50 to 1                        Ten Furlongs                  1
##  9 56 Up                          First Run Features            1
## 10 7 Witches                      Indican Pictures              1
## # … with 4,162 more rows

The only differences in the original and newly derived Per.Thtr. variables look to be rounding errors.

new_box %>% 
  group_by(Movie, Distributor)  %>%  
  mutate(change_diffs = Change - change) %>% 
  summarise(max_diff = max(change_diffs, na.rm = TRUE)) %>% 
  arrange(desc(max_diff))
## # A tibble: 4,172 x 3
## # Groups:   Movie [4,144]
##    Movie                          Distributor         max_diff
##    <chr>                          <chr>                  <dbl>
##  1 Tracks                         Weinstein Co.          26755
##  2 Not Today                      Freestyle Releasing      813
##  3 Phoenix, Oregon                Joma Films               648
##  4 K2: Siren of the Himalayas     First Run Features       248
##  5 The 100-Year-Old Man Who Cl... Music Box Films          151
##  6 Inequality for All             RADiUS-TWC               138
##  7 Watermark                      Entertainment One        129
##  8 The Hunger Games: Mockingja... Lionsgate                115
##  9 Boy Meets Girl                 Bramacharia LLC          108
## 10 Gemma Bovery                   Music Box Films           76
## # … with 4,162 more rows
new_box %>% 
  filter(Movie == "Tracks") %>% 
  arrange(Week) %>% 
  select(Date, Week, Rank, Rank.Last.Week, rank_last_week, Gross, Change, change, Per.Thtr., per_theater, Total.Gross, total_gross)
## # A tibble: 22 x 14
## # Groups:   Movie, Distributor [1]
##    Movie Distributor Date        Week  Rank Rank.Last.Week rank_last_week Gross
##    <chr> <chr>       <date>     <dbl> <dbl>          <dbl>          <dbl> <dbl>
##  1 Trac… Weinstein … 2014-09-19     1    NA             NA             NA   260
##  2 Trac… Weinstein … 2014-08-29     1    86             NA             NA  1213
##  3 Trac… Weinstein … 2014-06-06     1    62             NA             NA 13285
##  4 Trac… Weinstein … 2014-09-26     2    40             NA             62 70951
##  5 Trac… Weinstein … 2014-06-13     2    59             62             NA 15995
##  6 Trac… Weinstein … 2014-10-03     3    36             40             59 92899
##  7 Trac… Weinstein … 2014-06-20     3    54             59             NA 17049
##  8 Trac… Weinstein … 2014-10-10     4    45             36             54 64381
##  9 Trac… Weinstein … 2014-06-27     4    55             54             NA 22670
## 10 Trac… Weinstein … 2014-10-17     5    52             45             55 34563
## # … with 12 more rows, and 6 more variables: Change <dbl>, change <dbl>,
## #   Per.Thtr. <dbl>, per_theater <dbl>, Total.Gross <dbl>, total_gross <dbl>

There are some issues in how the variableWeek is derived. It looks like for some movies, if the movie falls off the charts and returns to the charts again, the Week count starts back at 1 which would be incorrect.

  1. Is the original variable Total.Gross strictly increasing?
ggplot(new_box, aes(x = Week, group = interaction(Movie, Distributor))) + 
  geom_line(aes(y = Gross), alpha= .4, color = "grey20") + xlim(0, 50)

ggplot(new_box, aes(x = Date, group = interaction(Movie, Distributor))) + 
  geom_line(aes(y = Total.Gross), alpha= .4, color = "grey20")

No, there look to be some issues with the data as it should be strictly increasing.

  1. Identify the three top grossing movies for each year. Plan of attack:
    • Extract the year from the Date variable.
    • Summarize the total gross for each movie and each year.
    • Find the rank of movies by total gross in each year.
    • Filter the top three movies.
new_box %>% 
  separate(Date, into = c("year", "month", "date"), remove = FALSE) %>% 
  group_by(year, Movie) %>% 
  summarise(Total.Gross.Max = max(Total.Gross)) %>% 
  arrange(desc(year), desc(Total.Gross.Max)) %>% 
  group_by(year) %>% 
  mutate(rank = min_rank(desc(Total.Gross.Max))) %>% 
  arrange(desc(year), rank) %>% 
  filter(rank < 4)
## # A tibble: 21 x 4
## # Groups:   year [7]
##    year  Movie                          Total.Gross.Max  rank
##    <chr> <chr>                                    <dbl> <int>
##  1 2019  Avengers: Endgame                    858366365     1
##  2 2019  The Lion King                        542287663     2
##  3 2019  Toy Story 4                          433560473     3
##  4 2018  Black Panther                        700006415     1
##  5 2018  Avengers: Infinity War               678809000     2
##  6 2018  Star Wars Ep. VIII: The Las...       620178595     3
##  7 2017  Titanic                              659328801     1
##  8 2017  Rogue One: A Star Wars Story         532171696     2
##  9 2017  Star Wars Ep. VIII: The Las...       517218368     3
## 10 2016  Star Wars Ep. VII: The Forc...       936658640     1
## # … with 11 more rows