+ - 0:00:00
Notes for current slide
Notes for next slide

MACS 30500 LECTURE 5

Topics: Factors in R. More dplyr. Data cleaning (recoding/renaming variables; missing data)

1 / 29

Agenda

  • Factors

  • More dplyr: review of verbs we already know and learn new ones

  • Data cleaning

    • variables names:
      • recoding and renaming variables
      • syntactic vs. non-syntactic variable names
    • missing data
2 / 29

Factors

3 / 29

What are factors in R?

Takeaway: Categorical variables, also called discrete variables, are variables with a fixed set of possible values. R uses factors to work with these variables.

So a "factor" in R is a data structure for working with categorical variables more effectively:

  • The default data structure for categorical variables: character vector
  • The data structure for categorical variables transformed into factors: factor
4 / 29

What are factors in R?

Takeaway: Categorical variables, also called discrete variables, are variables with a fixed set of possible values. R uses factors to work with these variables.

So a "factor" in R is a data structure for working with categorical variables more effectively:

  • The default data structure for categorical variables: character vector
  • The data structure for categorical variables transformed into factors: factor

What factors do:

  • Enable sorting of levels or categories of a categorical variable in your desired order
  • Example: you have a Likert Scale and want to create a bar chart that sorts the bars from "Strongly Agree" to "Strongly Disagree"
4 / 29

Steps to convert a character vector to factor

Define a character vector (e.g., categorical variable) with four months and sort it. What do you notice?

x1 <- c("Dec", "Apr", "Jan", "Mar")
class(x1)
## [1] "character"
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
5 / 29

Steps to convert a character vector to factor

We run into a problem while sorting our variable x1:

R's default behavior is to sort character vectors alphabetically.

However, as humans, we understand that this is not the best way to sort months. Instead, we may want to sort months chronologically. To tell that to R, we need to convert them to factors.

Let’s walk through the steps to do that!

6 / 29

Step 1: Define Levels

First, we define all possible values that the variable can take. We do so by creating another character vector with values in the exact order we want them to be:

month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
month_levels
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
class(month_levels)
## [1] "character"
7 / 29

Step 2: Convert to Factor

We then use the factor() function with the character vector we just created (month-levels) to covert the original character vector (x1) into a factor:

y1 <- factor(x1, levels = month_levels)
y1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
8 / 29

Step 3: Test by sorting

Sort the factor vector y1 and the original character vector x1 and observe the differences:

sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
9 / 29

Specify levels and labels

Another situation you might encounter is working with a numeric vector, rather than with a character vector, like this:

x2 <- c(12, 4, 1, 3)
class(x2)
## [1] "numeric"
10 / 29

Specify levels and labels

Another situation you might encounter is working with a numeric vector, rather than with a character vector, like this:

x2 <- c(12, 4, 1, 3)
class(x2)
## [1] "numeric"

In cases like this, the numbers are used to represent specific, discrete values. In our example, they are individual months.

10 / 29

Specify levels and labels

In such cases, you want to define levels and labels separately to achieve the desired order (here from 1 to 12) and the right labels (here the names of the months, using our previously defined variable month_levels):

y2 <- factor(x2,
levels = seq(from = 1, to = 12),
labels = month_levels)
y2
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y2)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
11 / 29

Specify levels and labels

One important thing to remember when converting a variable into a factor is that the number of levels and labels must match, that is each level is associated with one label.

This code does not work:

y2 <- factor(x2,
labels = month_levels)

This code works:

y2 <- factor(x2,
labels = c("January", "March", "April", "December"))
y2
sort(y2)
12 / 29

forcats package

The previous slides summarized the theory and logic behind factors in R.

In practice, you can use the package forcats to simplify your life when you work with factors. It provides several functions, such as:

  • fct_reorder() reorders a factor variable by levels of another variable
  • fct_relevel() changes the order of a factor variable by hand
  • fct_infreq() reorders a factor variable by its frequency of values (with the largest first)
  • fct_lump() collapses the least/most frequent values of a factor into “other”

Documentation and Cheat Sheet: https://forcats.tidyverse.org/

Let's see one example...

13 / 29

Data: tips left at restaurant by individuals by weekday

library(tidyverse)
df <- tibble(
week = c("Mon", "Wed", "Fri", "Wed", "Thu", "Sat", "Sat"),
tip = c(10, 12, 20, 8, 25, 25, 30)
)
df
## # A tibble: 7 × 2
## week tip
## <chr> <dbl>
## 1 Mon 10
## 2 Wed 12
## 3 Fri 20
## 4 Wed 8
## 5 Thu 25
## 6 Sat 25
## 7 Sat 30
14 / 29

Data: tips left at restaurant by individuals by weekday

Our Goal: Create a bar chart showing the amount of tips (y) given on each day of the week (x), with days ordered from Mon to Sat

Step 1. Use the correct function from the forcats() package to order the data:

  • transform variable week into factor
  • reorder the day of the week according to the amount of tip given

Step 2. Use ggplot to create the bar chart

15 / 29

This is how our final graph should look like:

16 / 29

Let's try!

First of all why we cannot just write this?

ggplot(df,
aes(x = week)) +
geom_bar()

Or this?

ggplot(df,
aes(x = tip)) +
geom_bar()
17 / 29

Let's try!

The best function to use in this example is fct_relevel() which changes the order of a factor by hand: https://forcats.tidyverse.org/reference/fct_relevel.html

# order
df %>%
mutate(week = fct_relevel(week,
"Mon", "Wed", "Thu", "Fri", "Sat")) %>%
group_by(week) %>%
summarize(total_tip = sum(tip)) %>%
# plot
ggplot(aes(x = week, y = total_tip)) +
geom_bar(stat = "identity") + # stat = "identity" for clarity, can omit
labs(title = "Bar Chart with Reordered Weekdays using `fct_relevel()`",
x = "Weekday",
y = "Tip ($)")
18 / 29

Let's try!

Notice, we can do the same task with the base R function factor() rather than using functions from the forcats pagage

# order
df %>%
mutate(week = factor(week,
c("Mon", "Wed", "Thu", "Fri", "Sat"))) %>%
group_by(week) %>%
summarize(total_tip = sum(tip)) %>%
# plot
ggplot(aes(x = week, y = total_tip)) +
geom_bar(stat = "identity") +
labs(title = "Bar Chart with Reordered Weekdays using `factor()`",
x = "Weekday",
y = "Tip ($)")
19 / 29

Takeaways from working with factors in R:

  • The code might run, but the output may not achive our goals! Make sure to check the documentation and your data for the best function to achieve your desired results

  • With factor variables: remember that levels and labels must match!

  • Download today's in-class materials for more practice exercises on working with factors!

20 / 29

More dplyr for data transformation: review of verbs we already know and learn new ones

21 / 29

These are the main verbs we learned last week:

function() Action performed
filter() Picks observations from the data frame based on their values (operates on rows)
arrange() Changes the order of observations based on their values (operates on rows)
select() Picks variables from the data frame based on their names (operates on columns)
rename() Changes the name of columns in the data frame
mutate() Creates new columns from existing ones
group_by() Changes the unit of analysis from the complete data frame to individual groups
summarize() Collapses the data frame to a smaller number of rows to summarize the larger data
22 / 29

Today we review the main verbs and add these new verbs:

function() Action performed
relocate() Changes the order of variables based on their name (operates on columns) vs. arrange() which operates on rows
count() Counts total observation by group
n_distinct() Counts the number of unique values in a given column, used together with summarize()
distinct() Returns unique rows from a dataframe based on specified columns
across() Performs the same operation to multiple columns simultaneously

We review old and new verbs using lecture5-more-dplyr.Rmd from today's in-class materials (download them from the website)

23 / 29

Data cleaning: recoding/renaming variables & syntactic vs. non-syntactic variables names

24 / 29

Data cleaning: recoding/renaming variables

Renaming variables: change variable names

Recoding variables: change the name of the levels of categorical variables

Code is in lecture5-rename-recode.Rmd from today's in-class materials (download them from the website)

25 / 29

Data cleaning: syntactic vs. non-syntactic variables names

A syntactic name is what R considers a valid name: letters, digits, . and _ but it can’t begin with symbols or a with a digit.

A non-syntactic name is a name that R does not consider a valid name: names that contain spaces, start with a digit or a symbol, or use reserved words such as TRUE, NULL, if, or function names. See the complete list by typing ?Reserved in your Console.

Code is in lecture5-rename-recode.Rmd from today's in-class materials (download them from the website)

26 / 29

Data cleaning: syntactic vs. non-syntactic variables names

A syntactic name is what R considers a valid name: letters, digits, . and _ but it can’t begin with symbols or a with a digit.

A non-syntactic name is a name that R does not consider a valid name: names that contain spaces, start with a digit or a symbol, or use reserved words such as TRUE, NULL, if, or function names. See the complete list by typing ?Reserved in your Console.

Code is in lecture5-rename-recode.Rmd from today's in-class materials (download them from the website)

Why does this matter??

Because you will encounter non-syntactic names more frequently than you might think, and in R, you need to use backticks (not quotes!) to handle them.

26 / 29

Data cleaning: missing data

27 / 29

What are missing data?

Our book distinguishes missing data between:

  • explicit: cells where you see a "NA"

  • implicit: absent data (e.g., an entire row is absent because not collected, etc.)

We focus on explicit missing data. I recommend reviewing Chapter 18 of "R for Data Science" for more info on implicit missing data.

28 / 29

Common ways to handle (explicit) missing data

We review the following three ways:

  • is.na()
  • na_rm = TRUE
  • drop_na()

Code is in lecture5-missing.Rmd from today's in-class materials (download them from the website)

29 / 29

Agenda

  • Factors

  • More dplyr: review of verbs we already know and learn new ones

  • Data cleaning

    • variables names:
      • recoding and renaming variables
      • syntactic vs. non-syntactic variable names
    • missing data
2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow