Submission Details

Due date: the homework is due before class on Thursday.

Submission process: submit both the R Markdown file and the corresponding html file on canvas. Please submit both the .Rmd and the .html files separately and do not zip the two files together.


Ames housing

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.

  2. The Ames based, non-profit company OAITI provides aoe open-source data sets. One of these data sets consists of information on all house sales in Ames between 2008 and 2010. The following piece of code allows you to read the dataset into your R session. How many house sales were there between 2008 and 2010? Which type of variables are we dealing with?

housing <- read.csv("https://raw.githubusercontent.com/OAITI/open-datasets/master/Housing%20Data/Ames-Housing.csv")
str(housing)
## 'data.frame':    1615 obs. of  10 variables:
##  $ SalePrice   : int  215000 124500 105000 172000 176500 157000 244000 237500 206900 345000 ...
##  $ Bedrooms    : int  3 2 2 3 3 4 3 4 4 4 ...
##  $ Baths       : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ LotArea     : int  31770 13008 11622 14267 11029 10200 11160 12925 11075 13860 ...
##  $ LivingArea  : int  1656 882 896 1329 1414 1434 2110 2117 2112 2704 ...
##  $ GarageArea  : int  528 502 730 312 601 528 522 550 576 538 ...
##  $ Neighborhood: Factor w/ 33 levels "Blmngtn","Blueste",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ HouseStyle  : Factor w/ 8 levels "1-Story","1.5 Fin",..: 1 1 1 1 1 1 1 1 4 8 ...
##  $ YearBuilt   : int  1960 1956 1961 1958 1958 1974 1968 1970 1969 1972 ...
##  $ YearSold    : int  2010 2009 2010 2010 2008 2009 2010 2008 2008 2009 ...

There were 1,615 house sales between 2008 and 2010. In this dataset there are two factors (Neighborhood and HouseStyle) and the remaining variables are integers.

  1. Do sales prices change over time? (Don’t test significances) Provide a graphic that supports your statement.
ggplot(housing, aes(x= factor(YearSold), y=SalePrice)) + geom_boxplot()

Sale prices are fairly consistent from 2008 to 2010. The median sales price between years looks fairly similar with several houses in each year having high price outliers.

  1. What is the relationship between sales prices and the size of the house (living area)? Make a chart and describe the relationship.
ggplot(data = housing, aes (x = LivingArea, y = SalePrice)) + geom_point()

There appears to be a strong, positive relationship between sales prices and the size; i.e. as living area increases, we see tend to see an increase in sales price. In addition, there is greater variation in sales price as the living area increases.

  1. Use dplyr functions to:
  1. Draw a chart of the average sale prices by neighborhood and comment on it. Only consider neighborhoods with at least 10 sales.

    Bonus: write the code for this question and the previous one in a single statement for +0.5 point extra credit.
housing %>%
  mutate(price_per_sqft = SalePrice / LivingArea) %>%
  group_by(Neighborhood) %>%
  summarise(n = n(),
            mean_price_per_sqft = mean(price_per_sqft),
            mean_sales_price = mean(SalePrice)) %>%
  filter(n >= 10) %>%
  arrange(mean_sales_price) %>%
  mutate(Neighborhood = forcats::fct_reorder(Neighborhood, mean_sales_price)) %>%
  ggplot(aes(x = Neighborhood, y = mean_sales_price)) +
  geom_point() +
  coord_flip()

Mean sales prices range from $94,000 to $310,000, with NridgHt and NoRidge having mean sales prices over $300,000. It appears that around half of neighborhoods have mean sales prices below $150,000.

  1. Use dplyr functions to:
  1. Draw a chart of the previous data set. Draw side-by-side boxplots of the garage area by YBCut. Facet by the style of house. Describe and summarise the chart.

    Bonus: write the code for this question and the previous one in a single statement for +0.5 point extra credit.
housing %>%
  mutate(garage = case_when(GarageArea == 0 ~ FALSE, TRUE ~TRUE)) %>%
  filter(garage == TRUE, HouseStyle == "1-Story" | HouseStyle == "2-Story") %>%
  mutate(YBCut = cut(YearBuilt, c(1800, 1850, 1900, 1950, 2000, 2010), 
                     labels = c("1800-1850", "1850-1900", "1900-1950", "1950-2000", "2000+"))) %>%
  ggplot(aes(x = YBCut, y = GarageArea)) +
  geom_boxplot() +
  facet_wrap(~ HouseStyle)

For both 1-story and 2-story homes, the typical garage size has increased since the time block 1900-1950, but the typical garage size appears to have been larger in the 1850-1900 block than in the 1900-1950 time block. Interestingly, in the 2000’s there seem to be more 1-story homes with larger garages than 2-story homes, though the range for 1-story home garages is much larger.