Due date: the homework is due before class on Thursday.
Submission process: submit both the R Markdown file and the corresponding html file on canvas. Please submit both the .Rmd
and the .html
files separately and do not zip the two files together.
Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.
The Ames based, non-profit company OAITI provides aoe open-source data sets. One of these data sets consists of information on all house sales in Ames between 2008 and 2010. The following piece of code allows you to read the dataset into your R session. How many house sales were there between 2008 and 2010? Which type of variables are we dealing with?
housing <- read.csv("https://raw.githubusercontent.com/OAITI/open-datasets/master/Housing%20Data/Ames-Housing.csv")
str(housing)
## 'data.frame': 1615 obs. of 10 variables:
## $ SalePrice : int 215000 124500 105000 172000 176500 157000 244000 237500 206900 345000 ...
## $ Bedrooms : int 3 2 2 3 3 4 3 4 4 4 ...
## $ Baths : int 1 1 1 1 1 2 2 2 2 2 ...
## $ LotArea : int 31770 13008 11622 14267 11029 10200 11160 12925 11075 13860 ...
## $ LivingArea : int 1656 882 896 1329 1414 1434 2110 2117 2112 2704 ...
## $ GarageArea : int 528 502 730 312 601 528 522 550 576 538 ...
## $ Neighborhood: Factor w/ 33 levels "Blmngtn","Blueste",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ HouseStyle : Factor w/ 8 levels "1-Story","1.5 Fin",..: 1 1 1 1 1 1 1 1 4 8 ...
## $ YearBuilt : int 1960 1956 1961 1958 1958 1974 1968 1970 1969 1972 ...
## $ YearSold : int 2010 2009 2010 2010 2008 2009 2010 2008 2008 2009 ...
1,615 houses were sold in Ames between 2008 and 2010. The dataset consists of integers and factor variables.
housing %>%
ggplot(aes(x= as.factor(YearSold), y=SalePrice)) + geom_boxplot()
Sale prices are pretty consistent from 2008 to 2010. The median sales price between years looks fairly similar with several houses in each year having high price outliers.
ggplot(data = housing, aes (x = LivingArea, y = SalePrice)) + geom_point()
There is a strong, positive, quadratic relationship between sales price and living area. There are several outliers in both directions. As the amount of living area increases, so does the sales price.
dplyr
functions to:housing %>%
mutate(PriceSqFt = SalePrice/(LivingArea)) %>%
group_by(Neighborhood) %>%
summarise(
n = n(),
avg = mean(PriceSqFt, na.rm = TRUE)
) %>%
filter( n > 10) %>%
mutate(Neighborhood = reorder(Neighborhood, avg)) %>%
ggplot(aes(x = Neighborhood, y = avg)) + geom_point() + theme(axis.text.x = element_text(angle = 90))
S&W ISU has the lowest average price per square foot, at about $50, and GrnHill has the highest average price per square foot, at about $183.
dplyr
functions to:HouseStyle
),YBCut
from YearBuilt
that introduces age categories that groups the year a house was built into intervals: 1800-1850, 1850-1900, 1950-2000, 2000+ (see ?cut
).YBCut
. Facet by the style of house. Describe and summarise the chart. housing %>%
mutate(Garage = GarageArea != 0) %>%
filter(Garage == TRUE, HouseStyle == "1-Story" | HouseStyle == "2-Story") %>%
mutate(YBCut = cut(YearBuilt, breaks = c(1800, 1850, 1900, 1950, 2000, 2010),
labels = c("1800-1850", "1850-1900", "1900-1950", "1950-2000", "2000+"))) %>%
ggplot(aes(x = YBCut, y = GarageArea)) + geom_boxplot() + facet_wrap(~HouseStyle) + theme(axis.text.x = element_text(angle = 90))
For one story houses, the highest median garage area is in the houses built after 2000, then the houses built from 1950-2000, 1850-1900, and 1950 to 2000. There doesn’t seem to be much of a spread for houses built before 1950. For two story houses, the highest median garage area is in the houses built after 2000, then the houses built from 1900-1950, 1850-1900, and 1950 to 2000