Factors
More dplyr: review of verbs we already know and learn new ones
Data cleaning
Takeaway: Categorical variables, also called discrete variables, are variables with a fixed set of possible values. R uses factors to work with these variables.
So a "factor" in R is a data structure for working with categorical variables more effectively:
Takeaway: Categorical variables, also called discrete variables, are variables with a fixed set of possible values. R uses factors to work with these variables.
So a "factor" in R is a data structure for working with categorical variables more effectively:
What factors do:
Define a character vector (e.g., categorical variable) with four months and sort it. What do you notice?
x1 <- c("Dec", "Apr", "Jan", "Mar")class(x1)
## [1] "character"sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"We run into a problem while sorting our variable x1:
R's default behavior is to sort character vectors alphabetically.
However, as humans, we understand that this is not the best way to sort months. Instead, we may want to sort months chronologically. To tell that to R, we need to convert them to factors.
Let’s walk through the steps to do that!
First, we define all possible values that the variable can take. We do so by creating another character vector with values in the exact order we want them to be:
month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")month_levels
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"class(month_levels)
## [1] "character"We then use the factor() function with the character vector we just created (month-levels) to covert the original character vector (x1) into a factor:
y1 <- factor(x1, levels = month_levels) y1
## [1] Dec Apr Jan Mar## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov DecSort the factor vector y1 and the original character vector x1 and observe the differences:
sort(y1)
## [1] Jan Mar Apr Dec## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Decsort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"Another situation you might encounter is working with a numeric vector, rather than with a character vector, like this:
x2 <- c(12, 4, 1, 3)class(x2)
## [1] "numeric"Another situation you might encounter is working with a numeric vector, rather than with a character vector, like this:
x2 <- c(12, 4, 1, 3)class(x2)
## [1] "numeric"In cases like this, the numbers are used to represent specific, discrete values. In our example, they are individual months.
In such cases, you want to define levels and labels separately to achieve the desired order (here from 1 to 12) and the right labels (here the names of the months, using our previously defined variable month_levels):
y2 <- factor(x2, levels = seq(from = 1, to = 12), labels = month_levels)y2
## [1] Dec Apr Jan Mar## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Decsort(y2)
## [1] Jan Mar Apr Dec## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov DecOne important thing to remember when converting a variable into a factor is that the number of levels and labels must match, that is each level is associated with one label.
This code does not work:
y2 <- factor(x2, labels = month_levels)This code works:
y2 <- factor(x2, labels = c("January", "March", "April", "December"))y2sort(y2)forcats packageThe previous slides summarized the theory and logic behind factors in R.
In practice, you can use the package forcats to simplify your life when you work with factors. It provides several functions, such as:
fct_reorder() reorders a factor variable by levels of another variablefct_relevel() changes the order of a factor variable by handfct_infreq() reorders a factor variable by its frequency of values (with the largest first)fct_lump() collapses the least/most frequent values of a factor into “other”Documentation and Cheat Sheet: https://forcats.tidyverse.org/
Let's see one example...
library(tidyverse)df <- tibble( week = c("Mon", "Wed", "Fri", "Wed", "Thu", "Sat", "Sat"), tip = c(10, 12, 20, 8, 25, 25, 30))df
## # A tibble: 7 × 2## week tip## <chr> <dbl>## 1 Mon 10## 2 Wed 12## 3 Fri 20## 4 Wed 8## 5 Thu 25## 6 Sat 25## 7 Sat 30Our Goal: Create a bar chart showing the amount of tips (y) given on each day of the week (x), with days ordered from Mon to Sat
Step 1. Use the correct function from the forcats() package to order the data:
week into factor week according to the amount of tip given Step 2. Use ggplot to create the bar chart

First of all why we cannot just write this?
ggplot(df, aes(x = week)) + geom_bar()Or this?
ggplot(df, aes(x = tip)) + geom_bar() The best function to use in this example is fct_relevel() which changes the order of a factor by hand: https://forcats.tidyverse.org/reference/fct_relevel.html
# order df %>% mutate(week = fct_relevel(week, "Mon", "Wed", "Thu", "Fri", "Sat")) %>% group_by(week) %>% summarize(total_tip = sum(tip)) %>%# plot ggplot(aes(x = week, y = total_tip)) + geom_bar(stat = "identity") + # stat = "identity" for clarity, can omit labs(title = "Bar Chart with Reordered Weekdays using `fct_relevel()`", x = "Weekday", y = "Tip ($)")Notice, we can do the same task with the base R function factor() rather than using functions from the forcats pagage
# order df %>% mutate(week = factor(week, c("Mon", "Wed", "Thu", "Fri", "Sat"))) %>% group_by(week) %>% summarize(total_tip = sum(tip)) %>%# plot ggplot(aes(x = week, y = total_tip)) + geom_bar(stat = "identity") + labs(title = "Bar Chart with Reordered Weekdays using `factor()`", x = "Weekday", y = "Tip ($)")
The code might run, but the output may not achive our goals! Make sure to check the documentation and your data for the best function to achieve your desired results
With factor variables: remember that levels and labels must match!
Download today's in-class materials for more practice exercises on working with factors!
dplyr for data transformation: review of verbs we already know and learn new onesfunction() |
Action performed |
|---|---|
filter() |
Picks observations from the data frame based on their values (operates on rows) |
arrange() |
Changes the order of observations based on their values (operates on rows) |
select() |
Picks variables from the data frame based on their names (operates on columns) |
rename() |
Changes the name of columns in the data frame |
mutate() |
Creates new columns from existing ones |
group_by() |
Changes the unit of analysis from the complete data frame to individual groups |
summarize() |
Collapses the data frame to a smaller number of rows to summarize the larger data |
function() |
Action performed |
|---|---|
relocate() |
Changes the order of variables based on their name (operates on columns) vs. arrange() which operates on rows |
count() |
Counts total observation by group |
n_distinct() |
Counts the number of unique values in a given column, used together with summarize() |
distinct() |
Returns unique rows from a dataframe based on specified columns |
across() |
Performs the same operation to multiple columns simultaneously |
We review old and new verbs using lecture5-more-dplyr.Rmd from today's in-class materials (download them from the website)
Renaming variables: change variable names
Recoding variables: change the name of the levels of categorical variables
Code is in lecture5-rename-recode.Rmd from today's in-class materials (download them from the website)
A syntactic name is what R considers a valid name: letters, digits, . and _ but it can’t begin with symbols or a with a digit.
A non-syntactic name is a name that R does not consider a valid name: names that contain spaces, start with a digit or a symbol, or use reserved words such as TRUE, NULL, if, or function names. See the complete list by typing ?Reserved in your Console.
Code is in lecture5-rename-recode.Rmd from today's in-class materials (download them from the website)
A syntactic name is what R considers a valid name: letters, digits, . and _ but it can’t begin with symbols or a with a digit.
A non-syntactic name is a name that R does not consider a valid name: names that contain spaces, start with a digit or a symbol, or use reserved words such as TRUE, NULL, if, or function names. See the complete list by typing ?Reserved in your Console.
Code is in lecture5-rename-recode.Rmd from today's in-class materials (download them from the website)
Why does this matter??
Because you will encounter non-syntactic names more frequently than you might think, and in R, you need to use backticks (not quotes!) to handle them.
Our book distinguishes missing data between:
explicit: cells where you see a "NA"
implicit: absent data (e.g., an entire row is absent because not collected, etc.)
We focus on explicit missing data. I recommend reviewing Chapter 18 of "R for Data Science" for more info on implicit missing data.
We review the following three ways:
is.na() na_rm = TRUEdrop_na()Code is in lecture5-missing.Rmd from today's in-class materials (download them from the website)
Factors
More dplyr: review of verbs we already know and learn new ones
Data cleaning
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |