class: center, middle, inverse, title-slide # Midterm Prep --- class:inverse # ⚠️ Disclaimer ⚠️ <br/> ### What these slides **ARE NOT**: - A comprehensive review of all topics covered on the midterm --- class: inverse, center, top background-image: url(https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/rmarkdown_rockstar.png) background-size: 600px background-position: 50% 75% # R Markdown review --- background-image: url("images/rmarkdown.png") background-size: 900px background-position: 0% 90% ## --- ## R Markdown <br/> ### Code chunk vs. Text chunk: Put your code answer in the code chunk Put your written answer in the text chunk <br/> ### Helpful code chunk options: **eval** - Run code in chunk (default = TRUE) **fig.align** - 'left', 'right', or 'center' (default = 'default') **fig.height**, **fig.width** - Dimensions of plots in inches --- ## Markdown syntax ``` *italic* **bold** # Header 1 ## Header 2 ### Header 3 - List item 1 - List item 2 - item 2a - item 2b 1. Numbered list item 1 1. Numbered list item 2 - item 2a - item 2b ``` .footnote[ Have a look at RStudio's [RMarkdown cheat sheet](https://www.rstudio.com/resources/cheatsheets/) ] --- class: inverse, middle, center # Getting help --- ## Getting help ### 1. Restart you Rstudio session [Ctrl/Command] + Shift + F10 ### 2. Read the documentation 📖 If you want to know what a specific `command()` is doing: `?command` ### 3. Review the slides <br> ### 4. Google the error 🔍 --- class: inverse .center[ # Topics ] .pull-left[ ### Intro to R - R intro - ggplot2 - logical vector & filters - factors & visualizing factors - forcats package ] .pull-right[ ### Tidyverse - dplyr - tidyr - reshaping data - dealing with messy data - lubridate - dates & times - visualizing time series ] --- ## Data types in R - logical: boolean values - ex. `TRUE` and `FALSE` - double: floating point numerical values (default numerical type) - ex. `1.335` and `7` - integer: integer numerical values (indicated with an L) - ex. `7L` and `1:3` - character: character string - ex. `"hello"` - factor: A special type of variable to indicate categories that includes both the labels and their order (i.e. numbers) - & more --- background-image: url(https://d33wubrfki0l68.cloudfront.net/cd4a81fe4e2b6bf9ea7af524faf5f8c9db039edb/5c672/images/hex-forcats.png) background-size: 120px background-position: 94% 5% ## Dealing with factors Factor variables often have to be re-ordered for ease of comparisons - We can manually specify the order of the levels by explicitly <br> listing them, see `help(factor)` - We can make the order of the levels in one variable dependent on the summary statistic of another variable - the `forcats` package (part of the `tidyverse` ⚠️ but not automatically loaded!) contains many functions to make this easier and less error-prone Helpful functions: `fct_infreq()`, `fct_reorder()`, `fct_reorder2()`, `fct_recode()`, `fct_collapse()`, `fct_lump()` --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/ggplot2.png) background-size: 120px background-position: 92% 5% ## Visualizing data <br> Explore how one (or more) variables are distributed: - barchart or histogram <br> Explore how two variables are related: - scatterplot, boxplot, tile plot <br> Explore how two variables are related, conditioned on other variables: - facetting, color & other aesthetics --- ## Histograms and barcharts What do we look for? - Symmetry/Skewness - Modes, Groups (big pattern: where is the bulk of the data?) - Gaps & Outliers (deviation from the big pattern: where are the other points?) --- ## Histograms A **histogram** approximates the distribution of a single numeric variable. It shows frequency of values in specified ranges. ### `geom_histogram()` - requires the `x` aesthetic inside `aes()` - Specify width of bars with the `bins` or `binwidth` argument - Can change appearance of the bars with `color`, `fill`, `alpha` arguments --- ## bar charts A **bar chart** displays counts of a categorical variable, and is the categorical equivalent of the histogram. ### `geom_bar()` - requires the `x` aesthetic inside `aes()` - optional `weight` aesthetic - Can change appearance of the bars with `color`, `fill`, `alpha` arguments - Can modify the order of the bars by reordering the factor --- ## scatterplots - need two continuous variables ### `geom_point()` - requires the `x`, `y` aesthetics inside `aes()` - Can change appearance of the points with `color`, `shape`, `size` arguments - Can use `fill` **if** `shape = 21` (or 22 or 23) --- ## line charts (time series) - nees two numeric variables ### `geom_line()` - requires the `x`, `y` aesthetics inside `aes()` - Can change appearance of the lines with `color`, `linetype`, `alpha` arguments - The `group` `aes` draws lines according to a grouping variable --- ## boxplots - need one numeric variable and one categorical variable. - are used for group comparisons and outlier identifications - usually only make sense in form of side-by-side boxplots. ### `geom_boxplot()` - requires the `x`, `y` aesthetics inside `aes()` - `y` is measurement, `x` is categorical - Can change appearance of the boxes with `color`, `fill`, `alpha` arguments --- ## Additional `aes()` mappings Add a third variable to a plot with: 🌈 **color** : `color, fill` ↕️ **size**: `size, stroke` 🔺 **shape**: `shape, linetype` --- ## Facets - plot subgroups separately - can be arranged by rows, columns, or both OR "wrapped" for many subgroups - great for exploratory analyses ### `facet_grid()` - create a grid of graphs, by rows and columns - has formula specification: `rows ~ cols` - no variable is written as `.`; i.e. `rowvar ~ .` are plots in a single column. ### `facet_wrap()` - create small multiples by "wrapping" a series of plots - has specification `~ variables` --- background-image: url(https://camo.githubusercontent.com/c1a290fbb20a6bb45e4c47fc2e7b7ddc8d8ccf12/68747470733a2f2f692e696d6775722e636f6d2f677965596c6b4c2e706e67) background-size: 400px background-position: 100% 55% ## The `tidyverse` .pull-left-large[ `tidyverse` is a package bundling several other R packages: - `ggplot2`, `dplyr`, `tidyr`, `purrr`, ... <br> Rules for functions: - First argument is always a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place ] --- ## Common structure All functions of the tidyverse have `data` as their first argument: - `function(data, arg2, ...)` The subsequent arguments describe what to do with the data frame, using <br>the variable names (without quotes) - **Important**: do not use `$` notation for variables within these functions, e.g: `filter(fbi, Year>=2019, State=="Iowa")` The output is a new data frame that will print or it can be saved with the assignment operator, `<-`: - e.g: `fbi_iowa19 <- filter(fbi, Year >= 2019, State == "Iowa")` <br/> The common structure makes it easy to chain together multiple simple steps using the pipe function (`%>%`) --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png) background-size: 120px background-position: 92% 5% ## the `dplyr` package Functions are thought of as **verbs** that manipulate data frames - `filter()`: pick rows by matching some criteria - `slice()` pick rows using index(es) - `select()`: select columns of a data frame by name - `pull()`: grab a column as a vector - `arrange()`: reorder the rows of a data frame - `mutate()`: add new or change existing columns of the data frame (as functions of existing columns) - `summarize()`: collapse many values down into a summary of the data frame - ... These can all be used with `group_by()` which changes the scope of function from entire dataset to group-by-group. --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png) background-size: 120px background-position: 92% 5% ## the `tidyr` package A "tidy" dataset, is much easier to work with inside the tidyverse. Tidy data: - Each variable is one column. - Each observation is one row. - Each value must have its own cell. Helpful functions: `pivot_longer()`, `pivot_wider()`, `unite()`, `separate()` Helpful functions from other packages: `parse_number()` from readr, the join functions from dplyr --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/lubridate.png) background-size: 120px background-position: 92% 5% ## dates & times <br> Converter functions: - Want to convert strings or numbers to date-times. <br> Accessor functions: - Want to get or assign a component of a date/time object. <br> Intervals, Durations, & Periods --- class: yourturn # your turn 1. Start a fresh session of Rstudio ([Ctrl/Command] + Shift + F10) 2. Load the following packages: `knitr`, `rmarkdown`, `ggplot2`, `forcats`, `readr`, `dplyr`, `tidyr`, `lubridate` - ⚠️ if you do not have these installed, you will need to do so and then begin again at step 1. 3. Run `sessionInfo()` 4. Copy the list under "other attached packages:" 5. Paste it into the "your turn" submission on canvas. --- class: yourturn # your turn (cont.) **Example**: ``` other attached packages: [1] rmarkdown_2.1 knitr_1.27 lubridate_1.7.4 forcats_0.4.0 [5] stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3 readr_1.3.1 [9] tidyr_1.0.2 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0 [13] devtools_2.2.1 usethis_1.5.1 ``` <br> **Note:** - Our version numbers may differ. - If your numbers are less than mine, please update. - If your numbers are greater than mine, that should be fine. --- ## Resources - RStudio cheat sheets: - Artwork by [@allison_horst](https://twitter.com/allison_horst?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)