class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 5 ] .author[ ### Topics: Factors in R. More dplyr. Data cleaning (recoding/renaming variables; missing data) ] --- class: inverse, middle ## Agenda * Factors * More `dplyr`: review of verbs we already know and learn new ones * Data cleaning * variables names: * recoding and renaming variables * syntactic vs. non-syntactic variable names * missing data <!-- FALL 2024 NOTES FOR NEXT TIME I added an in-class live coding (me typing and students following) on dplyr verbs. It worked but improvements for next time: * it was bit slow (close to 1hour): make it faster * add some practice exercises for students to break the flow Jean observed this class, incorporate her suggestions as well (see email) --> --- class: inverse, middle # Factors --- ### What are factors in R? **Takeaway:** Categorical variables, also called discrete variables, are variables with a fixed set of possible values. R uses factors to work with these variables. So a "factor" in R is a data structure for working with categorical variables more effectively: * The default data structure for categorical variables: *character vector* * The data structure for categorical variables transformed into factors: *factor* -- **What factors do:** * Enable sorting of levels or categories of a categorical variable in your desired order * Example: you have a Likert Scale and want to create a bar chart that sorts the bars from "Strongly Agree" to "Strongly Disagree" --- ### Steps to convert a character vector to factor Define a character vector (e.g., categorical variable) with four months and sort it. What do you notice? ```r x1 <- c("Dec", "Apr", "Jan", "Mar") class(x1) ``` ``` ## [1] "character" ``` ```r sort(x1) ``` ``` ## [1] "Apr" "Dec" "Jan" "Mar" ``` --- ### Steps to convert a character vector to factor **We run into a problem while sorting our variable `x1`:** R's default behavior is to sort character vectors alphabetically. However, as humans, we understand that this is not the best way to sort months. Instead, we may want to sort months chronologically. To tell that to R, we need to convert them to factors. Let’s walk through the steps to do that! <!-- note we use sort() because this is a standalone vector, while we used arrange() when working with dataframes --> --- ### Step 1: Define Levels First, we define all possible values that the variable can take. We do so by creating another character vector with values in the exact order we want them to be: ```r month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") month_levels ``` ``` ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" ``` ```r class(month_levels) ``` ``` ## [1] "character" ``` --- ### Step 2: Convert to Factor We then use the `factor()` function with the character vector we just created (`month-levels`) to covert the original character vector (`x1`) into a factor: ```r y1 <- factor(x1, levels = month_levels) y1 ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` --- ### Step 3: Test by sorting Sort the factor vector `y1` and the original character vector `x1` and observe the differences: ```r sort(y1) ``` ``` ## [1] Jan Mar Apr Dec ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` ```r sort(x1) ``` ``` ## [1] "Apr" "Dec" "Jan" "Mar" ``` --- ### Specify levels and labels Another situation you might encounter is working with a numeric vector, rather than with a character vector, like this: ```r x2 <- c(12, 4, 1, 3) class(x2) ``` ``` ## [1] "numeric" ``` -- In cases like this, the numbers are used to represent specific, discrete values. In our example, they are individual months. --- ### Specify levels and labels In such cases, you want to define levels and labels separately to achieve the desired order (here from 1 to 12) and the right labels (here the names of the months, using our previously defined variable `month_levels`): ```r y2 <- factor(x2, levels = seq(from = 1, to = 12), labels = month_levels) y2 ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` ```r sort(y2) ``` ``` ## [1] Jan Mar Apr Dec ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` --- ### Specify levels and labels One important thing to remember when converting a variable into a factor is that **the number of levels and labels must match**, that is each level is associated with one label. This code *does not work:* ``` y2 <- factor(x2, labels = month_levels) ``` This code works: ``` y2 <- factor(x2, labels = c("January", "March", "April", "December")) y2 sort(y2) ``` <!-- Check how to use the `factor()` function by typing `?factor` in your console, or by typing `factor` in the search bar in the Help tab --> --- ## `forcats` package The previous slides summarized the theory and logic behind factors in R. In practice, you can use the package `forcats` to simplify your life when you work with factors. It provides several functions, such as: - `fct_reorder()` reorders a factor variable by levels of another variable - `fct_relevel()` changes the order of a factor variable by hand - `fct_infreq()` reorders a factor variable by its frequency of values (with the largest first) - `fct_lump()` collapses the least/most frequent values of a factor into “other” Documentation and Cheat Sheet: https://forcats.tidyverse.org/ *Let's see one example...* --- ### Data: tips left at restaurant by individuals by weekday ```r library(tidyverse) df <- tibble( week = c("Mon", "Wed", "Fri", "Wed", "Thu", "Sat", "Sat"), tip = c(10, 12, 20, 8, 25, 25, 30) ) df ``` ``` ## # A tibble: 7 × 2 ## week tip ## <chr> <dbl> ## 1 Mon 10 ## 2 Wed 12 ## 3 Fri 20 ## 4 Wed 8 ## 5 Thu 25 ## 6 Sat 25 ## 7 Sat 30 ``` --- ### Data: tips left at restaurant by individuals by weekday **Our Goal: Create a bar chart showing the amount of tips (y) given on each day of the week (x), with days ordered from Mon to Sat** Step 1. Use the correct function from the `forcats()` package to order the data: * transform variable `week` into factor * reorder the day of the `week` according to the amount of `tip` given Step 2. Use `ggplot` to create the bar chart --- ### This is how our final graph should look like: <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="80%" style="display: block; margin: auto;" /> --- ### Let's try! First of all why we cannot just write this? ``` ggplot(df, aes(x = week)) + geom_bar() ``` Or this? ``` ggplot(df, aes(x = tip)) + geom_bar() ``` <!-- show this code and ask why does not work df %>% mutate(week = fct_reorder(week, tip)) %>% ggplot(aes(x = week)) + geom_bar() + labs(title = "Bar Chart Showing Count of Weekdays", x = "Weekday", y = "Count") Short version of the same code, works but not correct do not use for teaching # order df %>% mutate(week = fct_reorder(week, # variable to reorder tip)) %>% # levels of the other variable # plot ggplot(aes(x = week, y = tip)) + geom_bar(stat = "identity") + labs(title = "Bar Chart with Reordered Weekdays", x = "Weekday", y = "Tip ($)") --> --- ### Let's try! **The best function to use in this example is `fct_relevel()` which changes the order of a factor by hand: https://forcats.tidyverse.org/reference/fct_relevel.html** ```r # order df %>% mutate(week = fct_relevel(week, "Mon", "Wed", "Thu", "Fri", "Sat")) %>% group_by(week) %>% summarize(total_tip = sum(tip)) %>% # plot ggplot(aes(x = week, y = total_tip)) + geom_bar(stat = "identity") + # stat = "identity" for clarity, can omit labs(title = "Bar Chart with Reordered Weekdays using `fct_relevel()`", x = "Weekday", y = "Tip ($)") ``` --- ### Let's try! **Notice, we can do the same task with the base R function `factor()` rather than using functions from the `forcats` pagage ** ```r # order df %>% mutate(week = factor(week, c("Mon", "Wed", "Thu", "Fri", "Sat"))) %>% group_by(week) %>% summarize(total_tip = sum(tip)) %>% # plot ggplot(aes(x = week, y = total_tip)) + geom_bar(stat = "identity") + labs(title = "Bar Chart with Reordered Weekdays using `factor()`", x = "Weekday", y = "Tip ($)") ``` <!-- NOTES FOR ME NEXT TIME I TEACH THIS Add this on slides: Keep these tricks in mind for the homework assignments, especially HW3 (e.g., how to change a variable into a factor and manually pass labels for your graphs; and how to reorder a variable chronologically (here Mon to Sat) for plotting purposes) With factors: levels and labels must match!! Consider putting this challenge in the practice exercises for today as well run the code and check the graphs are correct For the dplyr part: add in the Rmd file demo an example of what happens if you calculate the mean of a boolean or logical vectors (T and F), say that T are interpreted as 1 and F are 0 for R; this is useful again for HW3! --> --- ### Takeaways from working with factors in R: * The code might run, but the output may not achive our goals! Make sure to check the documentation and your data for the best function to achieve your desired results * With factor variables: remember that levels and labels must match! * **Download today's in-class materials for more practice exercises on working with factors!** --- class: inverse, middle ## More `dplyr` for data transformation: review of verbs we already know and learn new ones --- ### These are the **main verbs** we learned last week: `function()` | Action performed --------------|-------------------------------------------------------- `filter()` | Picks observations from the data frame based on their values (operates on rows) `arrange()` | Changes the order of observations based on their values (operates on rows) `select()` | Picks variables from the data frame based on their names (operates on columns) `rename()` | Changes the name of columns in the data frame `mutate()` | Creates new columns from existing ones `group_by()` | Changes the unit of analysis from the complete data frame to individual groups `summarize()` | Collapses the data frame to a smaller number of rows to summarize the larger data --- ### Today we review the main verbs and add these **new verbs**: `function()` | Action performed --------------|-------------------------------------------------------- `relocate()` | Changes the order of variables based on their name (operates on columns) vs. `arrange()` which operates on rows `count()` | Counts total observation by group `n_distinct()`| Counts the number of unique values in a given column, used together with `summarize()` `distinct()` | Returns unique rows from a dataframe based on specified columns `across()` | Performs the same operation to multiple columns simultaneously We review old and new verbs using `lecture5-more-dplyr.Rmd` from today's in-class materials (download them from the website) --- class: inverse, middle ## Data cleaning: recoding/renaming variables & syntactic vs. non-syntactic variables names --- ### Data cleaning: recoding/renaming variables **Renaming variables:** change variable names **Recoding variables:** change the name of the levels of categorical variables Code is in `lecture5-rename-recode.Rmd` from today's in-class materials (download them from the website) --- ### Data cleaning: syntactic vs. non-syntactic variables names **A syntactic name** is what R considers a valid name: letters, digits, `.` and `_` but it can’t begin with symbols or a with a digit. **A non-syntactic name** is a name that R does not consider a valid name: names that contain spaces, start with a digit or a symbol, or use reserved words such as TRUE, NULL, if, or function names. See the complete list by typing `?Reserved` in your Console. Code is in `lecture5-rename-recode.Rmd` from today's in-class materials (download them from the website) -- *Why does this matter??* Because you will encounter non-syntactic names more frequently than you might think, and in R, you need to use backticks (not quotes!) to handle them. --- class: inverse, middle ## Data cleaning: missing data --- ### What are missing data? Our book distinguishes missing data between: * explicit: cells where you see a "NA" * implicit: absent data (e.g., an entire row is absent because not collected, etc.) We focus on explicit missing data. I recommend reviewing [Chapter 18](https://r4ds.hadley.nz/missing-values) of "R for Data Science" for more info on implicit missing data. --- ### Common ways to handle (explicit) missing data We review the following three ways: * `is.na()` * `na_rm = TRUE` * `drop_na()` Code is in `lecture5-missing.Rmd` from today's in-class materials (download them from the website)