Academic Honesty Statement

THIS IS AN INDIVIDUAL ASSESSMENT, THIS DOCUMENT AND YOUR ANSWERS ARE FOR YOUR EYES ONLY. ANY VIOLATION OF THIS POLICY WILL BE IMMEDIATELY REPORTED.

Replace the underscores below with your name acknowledging that you have read and understood your institution’s academic misconduct policy.

I, ____________, hereby state that I have not communicated with or gained information in any way from my classmates or anyone other than the Professor or TA during this exam, and that all work is my own.


Tracking the Global Outbreak of COVID-19

The coronavirus pandemic has sickened more than 1.4 million people, according to official counts. Here, we will explore both the global and local growth of COVID-19 using data sourced on April 8th, 2020.

Part I: Recovery data

This data set contains information on some of the first fully recovered cases of COVID-19. We will look at the time it took these patients to recover, defined as the number of days between a confirmed test and an official discharge date. The data is available at https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/data/covid19-recovered.csv

Question #1: An overview

  1. Read the data without downloading the file locally.
recovery_data <- readr::read_csv("https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/data/covid19-recovered.csv")
  1. A first look:
    • What are the dimensions of the data?
    • What variables are included and what are their types?
## your answer here

Question #2: Some wrangling

In order to continue with an analysis of this data, we should make some modifications to it.

  1. Use functions from the tidyverse package to make the following modifications:
    • Convert the variables confirmed and discharged into variables of type “date”.
    • Extract the numeric value from the variable recovery.
    • Re-derive the variable recovery as the number of days between confirmed and discharged and save as recovery_days.
    • Convert the variable category from type character to type factor.
    • Save this data as recovered and use this data for the remaining questions in part I.
## your answer here
  1. Look at a summary of the variables:
## your answer here
  1. What was the longest amount of time someone represented in this data took to recover from COVID-19? Which observation was this? Use indexing to print this row of the data frame.
## your answer here
  1. When was the first confirmed case in this data? Which observation is this? Use indexing to print this row of the data frame.
## your answer here

Question #4: Time to recovery

If indeed infected, how long would it take for you to be free of the novel coronavirus?

  1. Use ggplot2 to look at the distribution of the variable recovery (you may need to adjust the size of the bins).
## your answer here
  1. Is there a difference in the time it took to recover for different ages?
    • Create a new variable age_blks from age that introduces age categories that groups the ages of the patients into intervals: < 10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, and >80. (see ?cut).
    • Create side-by-side boxplots of the number of days to recovery for the different age groups.
    • Flip the coordinates and map the variable age_blks to the fill aesthetic.
## your answer here
  1. Is there a difference between the genders in the time it took to recover for any of the groups?
    • Use the age blocks created in the last question.
    • Create side-by-side boxplots for males and females (1’s and 0’s, respectively) for each of the age groups.
    • Fill your boxplots by mapping the variable gender to the aesthetic fill.
## your answer here

Part II: Global Data

Question #1: First Overview

  1. Read the data from https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/data/covid19-global.csv without downloading the file locally. Each line of the file contains daily counts for Province/State-County/Region pair.
## your answer here
  1. How many rows and columns does the data have?
## your answer here
  1. What are the variables called?
## your answer here
  1. Rename the variables Province/State, Country/Region, Lat, and Long to be province, country, lat, and long, respectively.
## your answer here
  1. Each row contains data for one province-country pair. How many countries are represented in this data set?
## your answer here
  1. For each country represented, how many provinces are recorded? Print a table for the five countries with the largest number of provinces recorded.
## your answer here
  1. How many countries do not have any provinces recorded in this data?
## your answer here

Question #2: Data wrangling

In order to continue with an analysis of this data, we should reshape it.

  1. Use functions from the tidyverse package to modify the shape and form of the data:
    • Use a function from dplyr to remove the lat and long variables from the cases data.
    • Then use a function from the tidyr package to move from wide format into long format where each row represents the number of confirmed cases on a particular date for each country-province pair.
    • Lastly, use a function from lubridate to convert the variable date from a string into an object of type date.
    • Save the resulting data frame as cases_long.
## your answer here
  1. Identify the nine countries with the largest number of confirmed cases and save these in a data frame named cases_by_country. Plan of attack:
    • Begin with the data frame cases_long.
    • Calculate the number of confirmed cases for each country on each date.
    • Find the rank of the countries by current number of confirmed cases for each country.
    • Filter the top nine countries.
    • Save this data frame as cases_by_country.
## your answer here

Question #3: Growth over time

  1. Let’s look at how the number confirmed cases for these nine countries grew over time.
    • Start with the data frame cases_by_country.
    • Use ggplot2 to plot the number of confirmed cases for each of the nine countries over time.
    • Map the variable country to color and use the function fct_reorder2() from the forcats package to align the colors of the lines with the colors in the legend.
    • Optional: to make the y-axis labels more readable, add the layer + scale_y_continuous(labels = scales::comma).
## your answer here
  1. Let’s next look at the difference the last week of March made (Mar 24 vs. Mar 31).
    • Use ggplot2 to create a barchart of the number of cases for the top nine countries for the two dates, sorted according to the total number of cases in that country.
    • Make sure the labels of the bars are readable and fill by country.
## your answer here

Question #4: Some summaries

  1. How many days did it take for each of the nine countries to go from their 500th case to their 20,000th case?
## your answer here
  1. Let’s take another look at how the number of cases has grown. This time, though, let’s look at the growth for each country starting at their 100th case.
    • For each country, calculate the first date that the country had 100 or more cases.
    • Introduce a new variable that transforms the date variable into the number of days since the 100th case.
    • Save this data frame as cases100.
    • Create a subset of the cases100 that contains only the last date and save as cases100_last.
    • Extra credit: Using cases100 and cases100_last, recreate the visualization below.

## your answer here