Economic Guide to Picking a College Major

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.

  2. Using the command below, read in the data set into your R session.

recent_grads <- read.csv("https://raw.githubusercontent.com/Stat480-at-ISU/materials-2020/master/02_r-intro/data/recent_grads.csv")
  1. Create a new variable, share_women, that is women as share of total; i.e. the number of women divided by the total number of men and women.
recent_grads$share_women <- recent_grads$Women/(recent_grads$Men+recent_grads$Women)
  1. Create a subset of the data containing only the rows where the major_category is STEM.
STEM2 <- recent_grads %>% filter(Major_category == "STEM")
  1. For your subset, compute the average of share_women and its standard deviation. Also compute the average median earnings (median) and its standard deviation. Comment on the results. (You might have to deal with missing values appropriately).
mean(STEM2$share_women, na.rm = TRUE)
## [1] 0.4756239
sd(STEM2$share_women, na.rm = TRUE)
## [1] 0.236922
mean(STEM2$Median)
## [1] 43086.54
sd(STEM2$Median)
## [1] 12765.18

The average share of total that are women is 0.4756 or 47.56%, with a standard deviation of 0.2369. For average median earnings, this has a value of $43086.54 with a standard deviation of $12765.18. Both standard deviations seem to be rather high, which means that there can be a greater fluctuation of values from sample to sample for both share_women and median.

  1. Compute the correlation between women as a share of total (share_women) and the median earnings (median) and interpret your results.
cor(STEM2$share_women, STEM2$Median, use = "complete.obs")
## [1] -0.6125043

The correlation between ‘share_women’ and ‘median’ is -0.6125043. This means that as the share of total increases, the median earnings decreases.

  1. Use the original dataset and ggplot2 to draw a scatterplot of share of total and the median earnings. Color points by the major category (major_category). Comment on the result.
library("ggplot2")
ggplot(data = recent_grads, aes(x=share_women, y=Median, color= Major_category))+geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

It appears from the scaaterplot that the higher the total of share of women there is, the lower the median earnings are across all major categories. This corresponds to the findings from the previous question.

Due date: the homework is due before class on Thursday.

For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html file with it.