MACS 30500 LECTURE 5

.title[
# MACS 30500 LECTURE 5
]
.author[
### Topics: Factors in R. More dplyr. Data cleaning (recoding/renaming variables; missing data)
]

---

## Agenda

* Factors

* More `dplyr`: review of verbs we already know and learn new ones

* Data cleaning
  * variables names: 
      * recoding and renaming variables
      * syntactic vs. non-syntactic variable names
  * missing data

---

# Factors

---

### What are factors in R?

**Takeaway:** Categorical variables, also called discrete variables, are variables with a fixed set of possible values. R uses factors to work with these variables.

So a "factor" in R is a data structure for working with categorical variables more effectively:
* The default data structure for categorical variables: *character vector*
* The data structure for categorical variables transformed into factors: *factor*

**What factors do:**
* Enable sorting of levels or categories of a categorical variable in your desired order
* Example: you have a Likert Scale and want to create a bar chart that sorts the bars from "Strongly Agree" to "Strongly Disagree"

---

### Steps to convert a character vector to factor

Define a character vector (e.g., categorical variable) with four months and sort it. What do you notice?

```r
x1 <- c("Dec", "Apr", "Jan", "Mar")
class(x1)
```

```
## [1] "character"
```

```r
sort(x1)
```

```
## [1] "Apr" "Dec" "Jan" "Mar"
```

---

### Steps to convert a character vector to factor

**We run into a problem while sorting our variable `x1`:**

R's default behavior is to sort character vectors alphabetically.

However, as humans, we understand that this is not the best way to sort months. Instead, we may want to sort months chronologically. To tell that to R, we need to convert them to factors.

Let’s walk through the steps to do that!

---

### Step 1: Define Levels

First, we define all possible values that the variable can take. We do so by creating another character vector with values in the exact order we want them to be:

```r
month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
month_levels
```

```
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
```

```r
class(month_levels)
```

```
## [1] "character"
```

---

### Step 2: Convert to Factor

We then use the `factor()` function with the character vector we just created (`month-levels`) to covert the original character vector (`x1`) into a factor:

```r
y1 <- factor(x1, levels = month_levels)  
y1
```

```
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

---

### Step 3: Test by sorting

Sort the factor vector `y1` and the original character vector `x1` and observe the differences:

```r
sort(y1)
```

```
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

```r
sort(x1)
```

```
## [1] "Apr" "Dec" "Jan" "Mar"
```

---

### Specify levels and labels

Another situation you might encounter is working with a numeric vector, rather than with a character vector, like this:

```r
x2 <- c(12, 4, 1, 3)
class(x2)
```

```
## [1] "numeric"
```

In cases like this, the numbers are used to represent specific, discrete values. In our example, they are individual months.

---

### Specify levels and labels

In such cases, you want to define levels and labels separately to achieve the desired order (here from 1 to 12) and the right labels (here the names of the months, using our previously defined variable `month_levels`):

```r
y2 <- factor(x2, 
  levels = seq(from = 1, to = 12),
  labels = month_levels)
y2
```

```
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

```r
sort(y2)
```

```
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

---

### Specify levels and labels

One important thing to remember when converting a variable into a factor is that **the number of levels and labels must match**, that is each level is associated with one label.

This code *does not work:*
```
y2 <- factor(x2, 
             labels = month_levels)
```

This code works:
```
y2 <- factor(x2, 
             labels = c("January", "March", "April", "December"))
y2
sort(y2)
```

---

## `forcats` package

The previous slides summarized the theory and logic behind factors in R.

In practice, you can use the package `forcats` to simplify your life when you work with factors. It provides several functions, such as:

- `fct_reorder()` reorders a factor variable by levels of another variable
- `fct_relevel()` changes the order of a factor variable by hand
- `fct_infreq()` reorders a factor variable by its frequency of values (with the largest first)
- `fct_lump()` collapses the least/most frequent values of a factor into “other”

Documentation and Cheat Sheet: https://forcats.tidyverse.org/

*Let's see one example...*

---

### Data: tips left at restaurant by individuals by weekday

```r
library(tidyverse)

df <- tibble(
  week = c("Mon", "Wed", "Fri", "Wed", "Thu", "Sat", "Sat"),
  tip = c(10, 12, 20, 8, 25, 25, 30)
)
df
```

```
## # A tibble: 7 × 2
##   week    tip
##   <chr> <dbl>
## 1 Mon      10
## 2 Wed      12
## 3 Fri      20
## 4 Wed       8
## 5 Thu      25
## 6 Sat      25
## 7 Sat      30
```

---

### Data: tips left at restaurant by individuals by weekday

**Our Goal: Create a bar chart showing the amount of tips (y) given on each day of the week (x), with days ordered from Mon to Sat**

Step 1. Use the correct function from the `forcats()` package to order the data:

* transform variable `week` into factor 
* reorder the day of the `week` according to the amount of `tip` given

Step 2. Use `ggplot` to create the bar chart

---

### This is how our final graph should look like:

---

### Let's try!

First of all why we cannot just write this? 
```
ggplot(df,
  aes(x = week)) +
  geom_bar()
```

Or this?
```
ggplot(df,
  aes(x = tip)) +
  geom_bar()
```

<!-- show this code and ask why does not work

df %>%
  mutate(week = fct_reorder(week, tip)) %>% 
  ggplot(aes(x = week)) +
  geom_bar() +
  labs(title = "Bar Chart Showing Count of Weekdays",
       x = "Weekday",
       y = "Count")

Short version of the same code, works but not correct do not use for teaching

# order
df %>%
  mutate(week = fct_reorder(week,        # variable to reorder
                            tip)) %>%    # levels of the other variable
# plot
  ggplot(aes(x = week, y = tip)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Chart with Reordered Weekdays",
       x = "Weekday",
       y = "Tip ($)")

-->

---

### Let's try!

**The best function to use in this example is `fct_relevel()` which changes the order of a factor by hand: https://forcats.tidyverse.org/reference/fct_relevel.html**

```r
# order 
df %>%
  mutate(week = fct_relevel(week, 
                            "Mon", "Wed", "Thu", "Fri", "Sat")) %>%
  group_by(week) %>%
  summarize(total_tip = sum(tip)) %>%
  
# plot
  ggplot(aes(x = week, y = total_tip)) +
  geom_bar(stat = "identity") +  # stat = "identity" for clarity, can omit
  labs(title = "Bar Chart with Reordered Weekdays using `fct_relevel()`",
       x = "Weekday",
       y = "Tip ($)")
```

---

### Let's try!

**Notice, we can do the same task with the base R function `factor()` rather than using functions from the `forcats` pagage **

```r
# order 
df %>%
  mutate(week = factor(week, 
                       c("Mon", "Wed", "Thu", "Fri", "Sat"))) %>%
  group_by(week) %>%
  summarize(total_tip = sum(tip)) %>%
  
# plot
  ggplot(aes(x = week, y = total_tip)) +
  geom_bar(stat = "identity") + 
  labs(title = "Bar Chart with Reordered Weekdays using `factor()`",
       x = "Weekday",
       y = "Tip ($)")
```

<!--

NOTES FOR ME NEXT TIME I TEACH THIS

Add this on slides: Keep these tricks in mind for the homework assignments, especially HW3 (e.g., how to change a variable into a factor and manually pass labels for your graphs; and how to reorder a variable chronologically (here Mon to Sat) for plotting purposes)

With factors: levels and labels must match!!

Consider putting this challenge in the practice exercises for today as well

run the code and check the graphs are correct

For the dplyr part: add in the Rmd file demo an example of what happens if you calculate the mean of a boolean or logical vectors (T and F), say that T are interpreted as 1 and F are 0 for R; this is useful again for HW3!  
-->

---

### Takeaways from working with factors in R:

* The code might run, but the output may not achive our goals! Make sure to check the documentation and your data for the best function to achieve your desired results

* With factor variables: remember that levels and labels must match!

* **Download today's in-class materials for more practice exercises on working with factors!**

---

## More `dplyr` for data transformation: review of verbs we already know and learn new ones

---

### These are the **main verbs** we learned last week:

`function()`  | Action performed
--------------|--------------------------------------------------------
`filter()`    | Picks observations from the data frame based on their values (operates on rows)
`arrange()`   | Changes the order of observations based on their values (operates on rows)
`select()`    | Picks variables from the data frame based on their names (operates on columns)
`rename()`    | Changes the name of columns in the data frame
`mutate()`    | Creates new columns from existing ones
`group_by()`  | Changes the unit of analysis from the complete data frame to individual groups
`summarize()` | Collapses the data frame to a smaller number of rows to summarize the larger data

---

### Today we review the main verbs and add these **new verbs**:

`function()`  | Action performed
--------------|--------------------------------------------------------
`relocate()`  | Changes the order of variables based on their name (operates on columns) vs. `arrange()` which operates on rows 
`count()`     | Counts total observation by group
`n_distinct()`| Counts the number of unique values in a given column, used together with `summarize()`
`distinct()`  | Returns unique rows from a dataframe based on specified columns
`across()`    | Performs the same operation to multiple columns simultaneously

We review old and new verbs using `lecture5-more-dplyr.Rmd` from today's in-class materials (download them from the website)

---

## Data cleaning: recoding/renaming variables & syntactic vs. non-syntactic variables names

---

### Data cleaning: recoding/renaming variables

**Renaming variables:** change variable names

**Recoding variables:** change the name of the levels of categorical variables

Code is in `lecture5-rename-recode.Rmd` from today's in-class materials (download them from the website)

---

### Data cleaning: syntactic vs. non-syntactic variables names

**A syntactic name** is what R considers a valid name: letters, digits, `.` and `_` but it can’t begin with symbols or a with a digit.

**A non-syntactic name** is a name that R does not consider a valid name: names that contain spaces, start with a digit or a symbol, or use reserved words such as TRUE, NULL, if, or function names. See the complete list by typing `?Reserved` in your Console.

Code is in `lecture5-rename-recode.Rmd` from today's in-class materials (download them from the website)

*Why does this matter??*

Because you will encounter non-syntactic names more frequently than you might think, and in R, you need to use backticks (not quotes!) to handle them.

---

## Data cleaning: missing data

---

### What are missing data?

Our book distinguishes missing data between:

* explicit: cells where you see a "NA"

* implicit: absent data (e.g., an entire row is absent because not collected, etc.)

We focus on explicit missing data. I recommend reviewing [Chapter 18](https://r4ds.hadley.nz/missing-values) of "R for Data Science" for more info on implicit missing data.

---

### Common ways to handle (explicit) missing data

We review the following three ways:

* `is.na()` 
* `na_rm = TRUE`
* `drop_na()`

Code is in `lecture5-missing.Rmd` from today's in-class materials (download them from the website)