Due date: the homework is due before class on Thursday.

For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html file with it.

Economic Guide to Picking a College Major

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.

  2. Using the command below, read in the data set into your R session.

recent_grads <- read.csv("https://raw.githubusercontent.com/Stat480-at-ISU/materials-2020/master/02_r-intro/data/recent_grads.csv")
  1. Create a new variable, share_women, in the dataset that is women as share of total; i.e. the number of women divided by the total number of men and women.
recent_grads$share_women <- recent_grads$Women / (recent_grads$Women + recent_grads$Men)
  1. Create a subset of the data containing only the rows where the Major_category is STEM.
recent_grads_subset <- recent_grads[recent_grads$Major_category == "STEM", ]
  1. For your subset, compute the average of share_women and its standard deviation. Also compute the mean median earnings (Median) and its standard deviation. Comment on the results. (You might have to deal with missing values appropriately).
mean(recent_grads_subset$share_women, na.rm = TRUE)
## [1] 0.4756239
sd(recent_grads_subset$share_women, na.rm = TRUE)
## [1] 0.236922
mean(recent_grads_subset$Median)
## [1] 43086.54
sd(recent_grads_subset$Median)
## [1] 12765.18

For the STEM majors, the average percentage of women is approximately 47.5% and the standard deviation is approximately 23.7%. Again for the STEM majors, the average median income is approximately $43,000 and the standard deviation is approximately $12,800.

  1. Again using the subset, compute the correlation between women as a share of total (share_women) and the median earnings (Median) and interpret your results.
cor(recent_grads_subset$share_women, recent_grads_subset$Median, use = "complete.obs")
## [1] -0.6125043

The correlation between these two variables is approx. -0.61 indicating that there is a moderately strong negative association. That is, as the share of women increases, we tend to see a decrease in the median pay.

  1. Use the original dataset and ggplot2 to draw a scatterplot of women as share of total and the median earnings. Color points by the major category (Major_category). Comment on the result.
library(ggplot2)
ggplot(recent_grads, aes(x = share_women, y = Median, color = Major_category)) + geom_point()

The scatterplot again indicates that there is a negative assocation between the two variables. That is, as the share of women increases, we tend to see a decrease in the median pay. We can also see that this relationship may hold for the entire dataset, not just the STEM majors. Lastly, there is an outlier in the top left corner.