```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE ) ``` Work on questions in R, make sure to keep a copy of your R code - you will be asked to submit this script at the end. Data on all flights in and out of Des Moines (DSM) for October 2008 are available at http://www.hofroe.net/data/dsm-flights.csv.
See http://stat-computing.org/dataexpo/2009/the-data.html for a description of the variables. 1. Load the flights data into R.
Determine, which flight was delayed the most on arrival. Report its row number, where it started, and by how much the flight was delayed on departure. ```{r} flights <- read.csv("http://www.hofroe.net/data/dsm-flights.csv") which.max(flights$ArrDelay) flights[which.max(flights$ArrDelay), c("Origin", "DepDelay")] ``` 2. Bring the variable 'Day' into the correct order, starting with 'Monday'. ```{r} summary(flights$Day) days <- levels(flights$Day) flights$Day <- factor(flights$Day, levels=days[c(2,6,7,5,1,3,4)]) summary(flights$Day) ``` 3. Create a new variable called 'Weekend' which has value TRUE for Saturdays and Sundays and FALSE otherwise. ```{r} # create new variable Weekend flights$Weekend <- flights$Day %in% c("Saturday", "Sunday") summary(flights$Weekend) ``` 4. Determine how many flights arrived in Des Moines on average each day of the week. ```{r} # idea 1: table(subset(flights, Dest=="DSM")$Day) # overall number of flights by day of week # problem: how many Mondays, Tuesdays, are there in October 2008? require(lubridate) octs <- data.frame( date = ymd(paste("2008/10/",1:31, sep=""))) octs$day = wday(octs$date, label=TRUE) table(octs$day) table(subset(flights, Dest=="DSM")$Day)/c(4,4,5,5,5,4,4) # idea 2: require(dplyr) require(lubridate) flights %>% filter(Dest == "DSM") %>% group_by(DayofMonth) %>% summarise( day = Day[1], n = n() ) %>% group_by(day) %>% summarize(avg = mean(n), n = n()) ``` 6. How many flights were scheduled to go to Denver (DEN)? What percentage of flights goes to Denver? ```{r} nrow(subset(flights, Dest=='DEN')) nrow(subset(flights, Dest=='DEN'))/nrow(subset(flights, Dest != 'DSM'))*100 ``` 7. Where do most flights arriving in Des Moines come from? (use IATA code)

```{r} flights %>% filter(Origin != "DSM") %>% group_by(Origin) %>% summarise(n = n()) %>% arrange(desc(n)) sort(table(flights$Origin), decreasing=T)[2] ``` 8. Plot boxplots of arrival delays by originating airports. Order boxplots according to increasing median arrival delay.

```{r} library(ggplot2) flights %>% filter(Dest =="DSM") %>% ggplot( aes(x = reorder(factor(Origin), ArrDelay, na.rm=T), y = ArrDelay)) + geom_boxplot() ``` 9. Using dplyr, determine for flights leaving DSM for each hour of the day - the number of flights scheduled to leave, - the percentage of flights departing late (FAA considers flights late, if they are 15 minutes or more behind the scheduled time), - the average departure delay, and - the top 3 destinations. Draw a scatterplot of average departure delay by scheduled hour of departure. Color points by top destination, adjust point size to reflect the number of flights for each hour. ```{r} dep.summary <- flights %>% filter(Origin == 'DSM') %>% mutate(hour = CRSDepTime%/%100) %>% group_by(hour, Dest) %>% mutate(test = n()) %>% group_by(hour) %>% mutate(rank = dense_rank(desc(test))) %>% summarise( count = n(), pct.delayed = sum(DepDelay>15, na.rm=TRUE)/n()*100, avg.delay = mean(DepDelay, na.rm=T), top.Dest.1 = Dest[which.max(test)], top.Dest.2 = first(Dest[which(rank == 2)]), top.Dest.3=first(Dest[which(rank == 3)])) dep.summary %>% ggplot(aes(x = hour, avg.delay, colour = top.Dest.1, size = count)) + geom_point() ```