Working with factor variables

class: center, middle, inverse, title-slide

# Working with factor variables

---

background-image: url(https://github.com/allisonhorst/stats-illustrations/raw/master/other-stats-artwork/nominal_ordinal_binary.png)
background-size: 500px
background-position: 50% 90%

## Factors

- A special type of variable to indicate categories

- Includes both the **labels** and their **order** (i.e. numbers)

- By default, text variables are stored in factors during input

- Numeric categorical variables have to be converted to factors manually

- `factor()` creates a new factor with specified labels

???
categorical variable = variables that have a fixed and known set of possible values

---
class: yourturn
# Your Turn

Load a sample of data from the [General Social Survey](http://gss.norc.org/) with the following code: `data(gss_cat, package = "forcats")`

Inspect the `gss_cat` object.

How many variables are there?

Which type does each of the variables have?

Make a summary of `year`.

Make `year` a factor variable: `gss_cat$year <- factor(gss_cat$year)`

Compare summary of `year` to the previous result.

Are there other variables that should be factors (or vice versa)?

---
class: inverse, center, middle
# Checking and casting data types

---
## Check the data type

Recall the data types in R:

- logical, double, integer, character, lists, & more

Checking for types:

- `str` or `typeof` provide info on type

- `is.XXX` (with XXX either `factor, int, numeric, logical, character, ...` ) checks for specific type

```r
typeof(TRUE)
```

```
## [1] "logical"
```

```r
typeof("hello")
```

```
## [1] "character"
```

```r
is.integer(1:3)
```

```
## [1] TRUE
```

---
## ⚠️ Data type coercion ⚠️

R is a dynamically typed language

It will happily convert between the various types without complaint

```r
c(1, "Hello")
```

```
## [1] "1"     "Hello"
```

```r
c(FALSE, 3L)
```

```
## [1] 0 3
```

```r
c(1.2, 3L)
```

```
## [1] 1.2 3.0
```

---
## Casting between types

`as.XXX` casts to specific type

```r
as.character(c(1, 9, 4.3, FALSE))
```

```
## [1] "1"   "9"   "4.3" "0"
```

```r
as.numeric(c(TRUE, FALSE, FALSE))
```

```
## [1] 1 0 0
```

```r
fruits <- factor(c("apple", "banana", "orange"))
as.numeric(fruits)
```

```
## [1] 1 2 3
```

---
## Casting between types

![](images/casting.png)

⚠️ `as.numeric` applied to a factor retrieves *order* of labels, not labels, even if those could be interpreted as numbers.

To get the labels of a factor as numbers, first cast to character and then to a number.

```r
as.numeric(factor(c(4,6,1,4,7)))
```

```
## [1] 2 3 1 2 4
```
---
class: inverse, center, middle
# working with factor variables

---
## Levels of factor variables

Use `levels(x)` to check the levels of factor variable `x` in their current order

```r
levels(gss_cat$race)
```

```
## [1] "Other"          "Black"          "White"          "Not applicable"
```

Use `summary(x)` to see order of the levels with counts for each level

```r
summary(gss_cat$race)
```

```
##          Other          Black          White Not applicable 
##           1959           3129          16395              0
```

`dplyr::count()` will also work but more on this next week...

```r
gss_cat %>% count(race)
```

```
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
```

---
## ⚠️ factors in bar charts ⚠️

A barchart can also provide us with a quick summary *provided* the levels have values.

.pull-left[

```r
ggplot(gss_cat, aes(race)) +
  geom_bar()
```

![](05_factors_files/figure-html/unnamed-chunk-8-1.png)
]

.pull-right[

```r
ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)
```

![](05_factors_files/figure-html/unnamed-chunk-9-1.png)
]

---
## ⚠️ factors in boxplots ⚠️

boxplots in ggplot2 only work properly if the x variable is a character variable or a factor:

```r
twoyear <- dplyr::filter(gss_cat, year %in% c(2000, 2014))
```

.pull-left[

```r
ggplot(data = twoyear, 
       aes(x = year, y = tvhours)) + 
  geom_boxplot()
```

![](05_factors_files/figure-html/unnamed-chunk-12-1.png)
]
.pull-right[

```r
ggplot(data = twoyear, 
       aes(x = factor(year), y = tvhours)) + 
  geom_boxplot()
```

![](05_factors_files/figure-html/unnamed-chunk-13-1.png)
]

---
class: inverse, center, middle
# Changing the order of the levels

---
background-image: url(https://d33wubrfki0l68.cloudfront.net/cd4a81fe4e2b6bf9ea7af524faf5f8c9db039edb/5c672/images/hex-forcats.png)
background-size: 120px
background-position: 94% 5%
## Changing the order

Factor variables often have to be re-ordered for ease of comparisons

- We can manually specify the order of the levels by explicitly listing them, see `help(factor)`

- We can make the order of the levels in one variable dependent on the summary statistic of another variable

- the `forcats` package (part of the `tidyverse` ⚠️ but not automatically loaded!) contains many functions to make this easier and less error-prone

![](05_factors_files/figure-html/unnamed-chunk-14-1.png)

---
## Reordering factor levels - the manual way

```r
levels(gss_cat$relig)
```

```
##  [1] "Moslem/islam"            "Buddhism"                "Other"                  
##  [4] "Not applicable"          "Native american"         "Protestant"             
##  [7] "Catholic"                "Inter-nondenominational" "Orthodox-christian"     
## [10] "Hinduism"                "No answer"               "None"                   
## [13] "Don't know"              "Other eastern"           "Jewish"                 
## [16] "Christian"
```

manually (⚠️ extremely sensitive to typos):

```r
levels(factor(gss_cat$relig, levels=c("Moslem/islam", "Hinduism", "Buddhism", "Other eastern", "Jewish", "Christian", "Orthodox-christian", "Catholic", "Protestant", "Inter-nondenominational", "Native american", "Other", "None", "No answer", "Don't know",  "Not applicable")))
```

```
##  [1] "Moslem/islam"            "Hinduism"                "Buddhism"               
##  [4] "Other eastern"           "Jewish"                  "Christian"              
##  [7] "Orthodox-christian"      "Catholic"                "Protestant"             
## [10] "Inter-nondenominational" "Native american"         "Other"                  
## [13] "None"                    "No answer"               "Don't know"             
## [16] "Not applicable"
```

Use the function `fct_infreq()` from the `forcats` package to order the categorical variable `relig` by its frequency.

```r
library(forcats)
ggplot(gss_cat, aes(x = fct_infreq(relig))) + 
  geom_bar() + 
  coord_flip()
```

![](05_factors_files/figure-html/unnamed-chunk-17-1.png)

---
## Reordering factor levels - using another variable

Base R option: `reorder(factor, numbers, function)`

reorder levels in factor by values in `numbers`. Use the argument `function` to summarise (average is used by default).

```r
levels(reorder(gss_cat$relig, gss_cat$tvhours, na.rm=TRUE))
```

```
##  [1] "Other eastern"           "Hinduism"                "Buddhism"               
##  [4] "Orthodox-christian"      "Moslem/islam"            "Jewish"                 
##  [7] "None"                    "No answer"               "Other"                  
## [10] "Christian"               "Inter-nondenominational" "Catholic"               
## [13] "Protestant"              "Native american"         "Don't know"             
## [16] "Not applicable"
```

missing values in `numbers`? make sure to use parameter `na.rm=TRUE`!

Option using the `forcats` package: `fct_reorder(factor, numbers, function)`

```r
relig_summary <- gss_cat %>% group_by(relig) %>% summarise(age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n())
```

.pull-left[

```r
ggplot(relig_summary, aes(x = tvhours, 
      y = relig)) + 
  geom_point()
```

![](05_factors_files/figure-html/unnamed-chunk-20-1.png)
]
 
.pull-right[

```r
ggplot(relig_summary, aes(x = tvhours, 
    y = fct_reorder(relig, tvhours))) + 
  geom_point()
```

![](05_factors_files/figure-html/unnamed-chunk-21-1.png)
]

---
class: yourturn
.center[
# Your turn 
]

For this your turn use the `fbi` object:

`fbi <- read.csv("https://raw.githubusercontent.com/Stat480-at-ISU/materials-2020/master/02_r-intro/data/fbi.csv")`

1. Introduce a rate of the number of reported offenses by population into the `fbi` data. You could use the *Ames standard* to make values comparable to a city of the size of Ames (population ~70,000).

2.  Plot boxplots of crime rates by different types of crime. How can you make axis text legible?

3. Reorder the boxplots of crime rates, such that the boxplots are ordered by their medians.

4. For one type of crime (subset!) plot boxplots of rates by state, reorder boxplots by median crime rates

---
class: inverse, center, middle
# Changing the values of the levels

---
## Changing Levels' names - using base R

```r
levels(gss_cat$partyid)
```

```
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"
```

```r
levels(gss_cat$partyid)[2] <- "Not sure"

levels(gss_cat$partyid)
```

```
##  [1] "No answer"          "Not sure"           "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"
```

---
## Changing Levels' names - using forcats

The most general and powerful tool is `fct_recode()`

- can recode, or change, the value of each level
  - levels not mentioned are left as is

```r
levels(gss_cat$partyid)
```

```r
partyid2 = fct_recode(gss_cat$partyid,
    "Republican, strong"    = "Strong republican", "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat", "Democrat, strong"      = "Strong democrat"
)
levels(partyid2)
```

```
##  [1] "No answer"             "Not sure"              "Other party"          
##  [4] "Republican, strong"    "Republican, weak"      "Independent, near rep"
##  [7] "Independent"           "Independent, near dem" "Democrat, weak"       
## [10] "Democrat, strong"
```

---
## Lump levels together

If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.

For each new variable, you can provide a vector of old levels:

```r
partyid2 = fct_collapse(gss_cat$partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat"))
levels(partyid2)
```

```
## [1] "other"    "Not sure" "rep"      "ind"      "dem"
```

You can lump together `n` small groups with `fct_lump()`

```r
relig2 = fct_lump(gss_cat$relig, n = 6)
summary(relig2)
```

```
## Protestant   Catholic       None     Jewish  Christian      Other 
##      10846       5124       3523        388        689        913
```

---
## Resources

- reference/document: https://forcats.tidyverse.org/

- Wickham & Grolemund's <a href="http://r4ds.had.co.nz/factors.html">chapter on factors</a> in *R for Data Science*

- Roger Peng: [*stringsAsFactors: An unauthorized biography*](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/")

- Thomas Lumley: <a href="http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh">stringsAsFactors = &lt;sigh&gt;</a>

- Artwork by [@allison_horst](https://twitter.com/allison_horst?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)