The coronavirus pandemic has sickened more than 1.4 million people, according to official counts. Here, we will explore both the global and local growth of COVID-19 using data sourced on April 8th, 2020.
This data set contains information on some of the first fully recovered cases of COVID-19. We will look at the time it took these patients to recover, defined as the number of days between a confirmed test and an official discharge date. The data is available at https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/data/covid19-recovered.csv
recovery_data <- readr::read_csv("https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/data/covid19-recovered.csv")
recovery_data %>% glimpse()
## Rows: 100
## Columns: 6
## $ confirmed <chr> "1/23/2020", "1/24/2020", "1/24/2020", "1/25/2020", "1/27/…
## $ discharged <chr> "2020/2/19", "2020/2/7", "2020/2/21", "2020/2/12", "2020/2…
## $ recovery <chr> "27 days", "14 days", "28 days", "18 days", "22 days", "24…
## $ category <chr> "Imported", "Imported", "Imported", "Imported", "Imported"…
## $ age <dbl> 66, 53, 37, 36, 56, 56, 35, 56, 56, 56, 31, 37, 73, 31, 47…
## $ gender <dbl> 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0…
There are 100 observations of 6 variables. The variables included consist of 4 character variables (confirmed
, discharged
, recovery
, and category
), and 2 numeric variables (age
and gender
).
In order to continue with an analysis of this data, we should make some modifications to it.
tidyverse
package to make the following modifications:
confirmed
and discharged
into variables of type “date”.recovery
.recovery
as the number of days between confirmed
and discharged
and save as recovery_days
.category
from type character
to type factor
.recovered
and use this data for the remaining questions in part I.recovered <- recovery_data %>%
mutate(confirmed = mdy(confirmed),
discharged = ymd(discharged),
recovery = parse_number(recovery),
category = factor(category),
recovery_days = discharged - confirmed)
recovered %>% summary()
## confirmed discharged recovery category
## Min. :2020-01-23 Min. :2020-02-04 Min. : 1.00 Imported:23
## 1st Qu.:2020-02-04 1st Qu.:2020-02-18 1st Qu.: 7.00 Local :77
## Median :2020-02-12 Median :2020-02-23 Median :11.00
## Mean :2020-02-11 Mean :2020-02-24 Mean :12.29
## 3rd Qu.:2020-02-17 3rd Qu.:2020-03-02 3rd Qu.:16.25
## Max. :2020-03-08 Max. :2020-03-14 Max. :31.00
## age gender recovery_days
## Min. : 0.50 Min. :0.0 Length:100
## 1st Qu.:34.75 1st Qu.:0.0 Class :difftime
## Median :41.50 Median :1.0 Mode :numeric
## Mean :42.53 Mean :0.6
## 3rd Qu.:54.00 3rd Qu.:1.0
## Max. :79.00 Max. :1.0
max(recovered$recovery)
## [1] 31
which.max(recovered$recovery)
## [1] 50
recovered[which.max(recovered$recovery), ]
## # A tibble: 1 x 7
## confirmed discharged recovery category age gender recovery_days
## <date> <date> <dbl> <fct> <dbl> <dbl> <drtn>
## 1 2020-02-12 2020-03-14 31 Local 54 1 31 days
The longest amount of time someone represented in this data took to recover from COVID-19 was 31 days. This observation is found in row 50.
min(recovered$confirmed)
## [1] "2020-01-23"
which.min(recovered$confirmed)
## [1] 1
recovered[which.min(recovered$confirmed), ]
## # A tibble: 1 x 7
## confirmed discharged recovery category age gender recovery_days
## <date> <date> <dbl> <fct> <dbl> <dbl> <drtn>
## 1 2020-01-23 2020-02-19 27 Imported 66 1 27 days
The first confirmed case in this data was on 2020-01-23. This observation is found in row 1.
If indeed infected, how long would it take for you to be free of the novel coronavirus?
ggplot2
to look at the distribution of the variable recovery
(you may need to adjust the size of the bins).ggplot(recovered) + geom_histogram(aes(recovery_days), binwidth = 2)
Most of the cases in this data recovered within 20 days of their confirmed test. A few of the cases in this data took close to 30 days to recover.
age_blks
from age
that introduces age categories that groups the ages of the patients into intervals: < 10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, and >80. (see ?cut).age_blks
to the fill aesthetic.recovered_ages <- recovered %>%
mutate(age_blks = cut(age, c(0, 10, 20, 30, 40, 50, 60, 70, 80, 100),
labels = c("< 10", "[10-20)", "[20-30)", "[30-40)", "[40-50)", "[50-60)", "[60-70)", "[70-80)", ">= 80"), right = FALSE))
recovered_ages %>%
ggplot() +
geom_boxplot(aes(x = age_blks, y = recovery, fill = age_blks), show.legend = FALSE) +
coord_flip()
While it is difficult to say much with any certainty given the small data set, there does not appear to be a large difference between the groups. The age group with the largest variability in the number of days it took to recover was the [50-60) age group. The 60-70 and < 10 age groups had the lowest median recovery time of around 5 days. The remaining groups had a median recovery time of around 11-13 days.
gender
to the aesthetic fill
.recovered_ages %>%
mutate(gender = factor(gender)) %>%
ggplot() +
geom_boxplot(aes(x = gender, y = recovery, fill = gender), show.legend = FALSE) +
facet_grid(~age_blks)
The only age groups with much of a difference between genders are the <10 and 60-70 age groups. However, this difference is more likely due to a very small sample size rather than there being a true difference.
cases <- readr::read_csv("https://raw.githubusercontent.com/Stat480-at-ISU/Stat480-at-ISU.github.io/master/exams/data/covid19-global.csv")
dim(cases)
## [1] 263 81
This data contains 263 observations and 81 variables.
#names(cases)
The variables are named Province/State
, Country/Region
, Lat
, Long
, and the dates 1/22/20
through 4/7/20
Province/State
, Country/Region
, Lat
, and Long
to be province
, country
, lat
, and long
, respectively.names(cases)[1:4] <- c("province", "country", "lat", "long")
length(unique(cases$country))
## [1] 184
There are 184 countries represented in this data.
cases %>%
group_by(country) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:5)
## # A tibble: 5 x 2
## country n
## <chr> <int>
## 1 China 33
## 2 Canada 15
## 3 France 11
## 4 United Kingdom 11
## 5 Australia 8
China has the largest number of provinces recorded in this data with 33 provinces.
cases %>%
count(country) %>%
filter(n == 1) %>%
tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 177
In this instance we want to include the NA
’s. If you look at the countries that contain provinces and an NA
(there are 4 of them), the NA
is referring to the mainland. Thus, in this data, there are 177 countries that do not have a province recorded here. In addition, if remove Diamond Princess and MS Zaandam cruise, there are actually 175 countries.
In order to continue with an analysis of this data, we should reshape it.
tidyverse
package to modify the shape and form of the data:
dplyr
to remove the lat
and long
variables from the cases
data.tidyr
package to move from wide format into long format where each row represents the number of confirmed cases on a particular date for each country-province pair.lubridate
to convert the variable date
from a string into an object of type date
.cases_long
.cases_long <- cases %>%
select(-lat, -long) %>%
pivot_longer(cols = -c(province, country),
names_to = "date",
values_to = "confirmed") %>%
mutate(date = lubridate::mdy(date))
cases_by_country
. Plan of attack:
cases_long
.cases_by_country
.cases_by_country <- cases_long %>%
group_by(date, country) %>%
summarise(confirmed = sum(confirmed)) %>%
group_by(country) %>%
mutate(total = max(confirmed)) %>%
ungroup() %>%
mutate(rank = dense_rank(desc(total))) %>%
filter(rank < 10)
cases_by_country %>% arrange(desc(date), desc(total)) %>% slice(1:9) %>% select(rank, country, total)
## # A tibble: 9 x 3
## rank country total
## <int> <chr> <dbl>
## 1 1 US 396223
## 2 2 Spain 141942
## 3 3 Italy 135586
## 4 4 France 110065
## 5 5 Germany 107663
## 6 6 China 82718
## 7 7 Iran 62589
## 8 8 United Kingdom 55949
## 9 9 Turkey 34109
The nine countries with the greatest number of total cases: US, Spain, Italy, France, Germany, China, Iran, United Kingdom, Turkey.
cases_by_country.
ggplot2
to plot the number of confirmed cases for each of the nine countries over time.country
to color and use the function fct_reorder2()
from the forcats
package to align the colors of the lines with the colors in the legend.+ scale_y_continuous(labels = scales::comma)
.ggplot(cases_by_country) +
geom_line(aes(x = date, y = confirmed, color = forcats::fct_reorder2(country, date, confirmed))) +
scale_y_continuous(labels = scales::comma)
Most countries have seen a considerable increase in the number of cases since mid-March - the exception to this case is China whose number of cases leveled off mid-February. The US has by far the largest number of cases and has yet to begin to level off like most of the other countries have.
ggplot2
to create a barchart of the number of cases for the top nine countries for the two dates, sorted according to the total number of cases in that country.cases_by_country %>%
filter(date == "2020-03-31" | date == "2020-03-24") %>%
mutate(country = forcats::fct_reorder(country, confirmed, .fun = max)) %>%
ggplot() +
geom_bar(aes(x = country, weight = confirmed, fill = country), show.legend = FALSE) +
facet_grid(date~.) +
coord_flip() +
labs(x="", y="")
The number of cases in the US increased dramatically over the last week in March. Other countries that saw a substantial increase include Spain, Germany, Italy, and France.
cases_by_country %>%
filter(confirmed >= 500) %>%
group_by(country) %>%
arrange(date) %>%
summarise(min_value = confirmed[which.min(confirmed)],
max_value = first(confirmed[confirmed>=20000]),
date_first=date[which.min(confirmed)],
date20k=date[first(which(confirmed>=20000))],
days_to_20k = date20k - date_first) %>%
select(country, days_to_20k) %>%
arrange(days_to_20k)
## # A tibble: 9 x 2
## country days_to_20k
## <chr> <drtn>
## 1 China 13 days
## 2 Spain 13 days
## 3 Turkey 13 days
## 4 US 13 days
## 5 Germany 15 days
## 6 Italy 16 days
## 7 France 17 days
## 8 United Kingdom 17 days
## 9 Iran 21 days
Spain, China, Turkey, and the US went from 500 to 20,000 cases in the shortest amount of time – 13 days. Germany took 15 days, Italy 16 days, France and the UK 17 days, and Iran 21 days.
cases100
.cases100
that contains only the last date and save as cases100_last
.cases100
and cases100_last
, recreate the visualization below.Create the 2 data frames:
cases100 <- cases_by_country %>%
filter(confirmed >= 100) %>%
group_by(country) %>%
arrange(date) %>%
mutate(date100 = date[which.min(confirmed)],
days_from_100 = (date100 %--% date)/ddays(1))
cases100_last <- filter(cases100, date == max(cases100$date))
Re-create the visualization:
cases100 %>%
ggplot(aes(x = days_from_100, y = confirmed, color = forcats::fct_reorder2(country, date, confirmed))) +
geom_line() +
geom_text(data = cases100_last, aes(label = country, x = days_from_100 + 0.5), hjust = "left", size = 3) +
geom_point(data = cases100_last) +
scale_y_log10() +
guides(color = FALSE)
This visualization appears to show that (assuming no issues in testing and reporting) growth rate has, fortunately, slowed for all nine countries included here.