Submission Details

Due date: the homework is due before class on Thursday.

Submission process: submit both the R Markdown file and the corresponding html file on canvas. Please submit both the .Rmd and the .html files separately and do not zip the two files together.


Spotify data

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.

  2. Using the command below, read in the spotify data set into your R session.

spotify <- read.csv("https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/homework/data/spotify.csv")
  1. Use one of our object inspecting functions and interpret the result in the data that you see.
str(spotify)
## 'data.frame':    10000 obs. of  13 variables:
##  $ track_artist     : Factor w/ 5041 levels "!!!","!deladap",..: 3969 387 222 1565 4562 366 3192 3756 3364 2433 ...
##  $ track_popularity : int  30 20 2 44 54 67 54 58 13 56 ...
##  $ playlist_genre   : Factor w/ 6 levels "edm","latin",..: 6 3 3 5 6 5 2 1 5 2 ...
##  $ playlist_subgenre: Factor w/ 24 levels "album rock","big room",..: 8 11 11 9 8 9 20 5 7 12 ...
##  $ release_date     : Factor w/ 2679 levels "1957-01-01","1963-05-27",..: 2587 1815 1724 2507 814 2561 2170 2465 1615 2298 ...
##  $ duration_min     : num  1.92 5.27 4.44 3.49 3.16 ...
##  $ danceability     : num  0.297 0.546 0.589 0.74 0.451 0.652 0.701 0.698 0.625 0.863 ...
##  $ energy           : num  0.974 0.591 0.846 0.721 0.884 0.862 0.772 0.895 0.805 0.627 ...
##  $ loudness         : num  -4.1 -5.65 -5.46 -6.32 -3.45 ...
##  $ speechiness      : num  0.135 0.0284 0.0377 0.177 0.0335 0.206 0.184 0.101 0.388 0.206 ...
##  $ acousticness     : num  0.00366 0.0616 0.069 0.762 0.00177 0.161 0.0515 0.00764 0.552 0.0485 ...
##  $ liveness         : num  0.113 0.275 0.0904 0.0922 0.193 0.203 0.0959 0.217 0.108 0.0968 ...
##  $ tempo            : num  131.3 120 122 120 97.5 ...

This data set contains 13 variables and 10,000 observations. Of the 13 variables, 3 are of type factor, 1 of type integer, and the rest of numeric.

  1. Use the package ggplot2 to draw a barchart of the genres. In addition, map the genre categories to the fill color of the barchart.
library(ggplot2)
ggplot(spotify, aes(x = playlist_genre, fill = playlist_genre)) + geom_bar()

This is a bar chart with the categorical variable playlist_genre mapped to the x axis and to the fill color. The heights of the bars represent the number of songs that fall within that bar’s category. From this plot we gain the impression that there is roughly an equal number of songs for each category in this dataset.

  1. Use the package ggplot2 to draw a histogram of one of the continuous variables in the dataset. Use fill color to show the genre categories and adjust the binwidth if necessary. Use facet_wrap() to create a histogram for each of the genre categories.
ggplot(spotify, aes(x = duration_min, fill = playlist_genre)) + geom_histogram()

ggplot(spotify, aes(x = duration_min, fill = playlist_genre)) + geom_histogram() + facet_wrap(~playlist_genre)

This is a histogram with the quantitative variable duration_min mapped to the x axis and the categorical variable playlist_genre mapped to the fill color. The heights of the bars represent the number of songs that have a duration within that bin. From this plot we gain the impression that most songs are 3-4 minutes long and there are only a few songs less than a minute and only a few songs more than 7 minutes. There is not a large difference in the distributions of the different genres.

  1. Use the package ggplot2 to draw a scatterplot to compare the length of the song with the energy measure. Use color to show the genre categories.
ggplot(spotify, aes(y = energy, x = duration_min, color = playlist_genre)) + geom_point()

This is a scatterplot with the quantitative variables duration_min and energy mapped to the x and y axis, respectively. In addition, the categorical variable playlist_genre is mapped to the color of the points. From this plot we again gain the impression that most songs are 3-4 minutes long and there are only a few songs less than a minute and only a few songs more than 7 minutes. There is does not appear to be a relationship between duration_min and energy, nor is there a large difference in the different genres.

  1. For each of the three figures above, write a two-three sentence summary, describing the
    1. structure of the plot: what type of plot is it? Which variables are mapped to x, to y, and to the (fill) color?
    2. main message of the plot: what is your main finding, i.e. what do you want viewers to learn from the plot?
    3. additional message: point out anomalies or outliers, if there are any.