Due date: the homework is due before class on Thursday.
Submission process: submit both the R Markdown file and the corresponding html file on canvas. Please submit both the .Rmd
and the .html
files separately and do not zip the two files together.
Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.
FiveThirtyEight is a website founded by Statistician and writer Nate Silver to publish results from opinion poll analysis, politics, economics, and sports blogging. One of the featured articles considers flying etiquette. This article is based on data collected by FiveThirtyEight and publicly available on github. Use the code below to read in the data from the survey:
fly <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
The next couple of lines of code provide a bit of cleanup of the demographic information by reordering the levels of the corresponding factor variables. Run this code in your session.
fly$Age <- factor(fly$Age, levels=c("18-29", "30-44", "45-60", "> 60", ""))
fly$Household.Income <- factor(fly$Household.Income, levels = c("$0 - $24,999","$25,000 - $49,999", "$50,000 - $99,999", "$100,000 - $149,999", "150000", ""))
fly$Education <- factor(fly$Education, levels = c("Less than high school degree", "High school degree", "Some college or Associate degree", "Bachelor degree", "Graduate degree", ""))
How.often.do.you.travel.by.plane.
). Reorder the levels in the variable by travel frequency from least frequent travel to most frequent. Draw a barchart of travel frequency and comment on it.library(ggplot2)
fly$How.often.do.you.travel.by.plane.<- factor(fly$How.often.do.you.travel.by.plane., levels=c("Never", "Once a year or less", "Once a month or less" ,"A few times per month", "A few times per week", "Everyday" ,""))
#barchart:
ggplot(fly, aes(How.often.do.you.travel.by.plane.)) + geom_bar() + coord_flip() + xlab("How Often Do You Travel by Plane?")
Based on the barchart above, it can be observed that most people travel once a year or less by plane, followed by never, once a month or less, a few times per month and lastly, a few times per week. The data also had a few no responses represented by the NA in the chart.
Education
, Age
, and Houshold.Income
), replace all occurrences of the empty string "" by a missing value NA
. How many responses do not have any missing values? (Hint: the function is.na
might come in handy)levels(fly$Education)[6]<- NA
levels(fly$Age)[5]<- NA
levels(fly$Household.Income)[6]<- NA
summary(is.na(fly$Education))
## Mode FALSE TRUE
## logical 1001 39
summary(is.na(fly$Age))
## Mode FALSE TRUE
## logical 1007 33
summary(is.na(fly$Household.Income))
## Mode FALSE TRUE
## logical 826 214
It can be observed via the is.na function on R that for Education, Age and Household.Income variables, there are 39, 33 and 214 missing values respectively.
fly$Education = with(fly, factor(Education, levels = rev(levels(Education))))
ggplot(data = fly, aes(x = 1)) +
geom_bar(aes(fill=Education), position="fill") +
coord_flip() +
theme(legend.position="bottom") +
scale_fill_brewer(na.value="grey50") +
ylab("Ratio")
The plot above shows the ratio of the different education levels to the total number of responses. It can also be interpreted in percentages. For instance, less than 12.5% people had no responses for this question, while 50% of the people responded have graduate degrees and bachelor degrees. This plot could be useful for someone who is more interested in the cumulative dataset, rather than individual variables.
In.general..is.itrude.to.bring.a.baby.on.a.plane.
to baby.on.plane.
. How many levels does the variable baby.on.plane
have, and what are these levels? Rename the level labeled "" to “Not answered”. Reorder the levels of baby.on.plane
from least rude to most rude. Put the level “Not answered” last. Draw a barchart of variable baby.on.plane
. Interpret the result.library(dplyr)
fly <- fly %>% rename(baby.on.plane.= In.general..is.itrude.to.bring.a.baby.on.a.plane.)
levels(fly$baby.on.plane.)
## [1] "" "No, not at all rude" "Yes, somewhat rude"
## [4] "Yes, very rude"
levels(fly$baby.on.plane.)[1]<- "Not Answered"
fly$baby.on.plane. <- factor(fly$baby.on.plane., levels=c("No, not at all rude", "Yes, somewhat rude", "Yes, very rude", "Not Answered"))
ggplot(fly, aes(baby.on.plane.)) + geom_bar() + coord_flip() + xlab("Is it rude to bring a baby on a plane?")
It can be observed from the barchart above that most people responded that “no, not at all rude” to have a baby on the plane, followed by “Not Answered,” Yes, somewhat rude," and lastly, “Yes, very rude.”
Do.you.have.any.children.under.18.
and baby.on.plane
. How is the attitude towards babies on planes shaped by gender and having children under 18? Find a plot that summarises your findings (use ggplot2
).levels(fly$Gender)[1]<- "NA"
levels(fly$Do.you.have.any.children.under.18.)[1]<- "NA"
ggplot(fly, aes(baby.on.plane., fill=Gender, alpha = Do.you.have.any.children.under.18.)) + geom_bar() + coord_flip()
## Warning: Using alpha for a discrete variable is not advised.
To evaluate more than 2 categorical variables, I used both the fill nad alpha function. The fill function uses color to distinguish the different category for gender. The green color bars represents the female gender, while the blue represents the male gender. The alpha function allows for the transparency of the colors in the bars to represent that no response, followed by “no” and “yes,” as transparency decreases. Hence, it can be observed that most responses for the “no, not at all rude” are females who also responded “no” to “having a baby under the age of 18.” On the contrary, most responses for “yes, very rude” are males, who also responded that “no” to “having a baby under the age of 18.” Those that responded “yes” to “having a a baby under the age of 18” mostly responded to “no, not rude at all” for having a baby on the plane.