class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 3 ] .author[ ### Topics: Introduction to dplyr. Operators. Pipes. ] --- class: inverse, middle # Agenda * Intro to `dplyr` and to Programming as Problem-Solving * Operators * Main `dplyr` functions * Pipes, which are written as `%>%` or `|>` --- class: inverse, middle # Intro to `dplyr` and to Programming as Problem-Solving --- ## `penguins` [Meet the Palmer Penguins!]( https://allisonhorst.github.io/palmerpenguins/#meet-the-palmer-penguins) The palmerpenguins package (already installed on Workbench) contains two datasets: * `penguins`: clean data on three species of penguins (Adelie, Chinstrap, Gentoo) from three islands in Antartica; total data 244 penguins * `penguins_raw`: raw data Today we will be using the first of the two: `penguins` --- ## `penguins` ```r library(tidyverse) library(palmerpenguins) head(penguins) ``` ``` ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## # ℹ 2 more variables: sex <fct>, year <int> ``` -- If I ask "**What is the average body mass of Adelie penguins?**" Think about (1) the *conceptual steps* you need to answer this question, and (2) how to *translate them into `dplyr` code* <!-- recall there re 244 penguins, roughly divided into the three groups, so about 80 Adelie. Now envision the result: it will be one single number, that is the mean of Adelie penguins body mass! --> --- ### Q1: What is the average body mass of an Adelie penguin? .panelset[ .panel[.panel-name[Conceptual] 1. First, we need to get the **input** data: `penguins` 1. Next, we need to **filter** only the observations classified as `species` Adelie 1. Finally, we need to calculate the **mean** of the variable `body_mass_g` for this group ``` ## # A tibble: 5 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## # ℹ 2 more variables: sex <fct>, year <int> ``` ] .panel[.panel-name[Code] ```r data(penguins) penguins_adelie <- filter(.data = penguins, species == "Adelie") summarize(.data = penguins_adelie, avg_mass_adelie = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## avg_mass_adelie ## <dbl> ## 1 3701. ``` ] ] --- ### Q2: What is the average body mass for each penguin species? .panelset[ .panel[.panel-name[Conceptual] 1. First, we need to get the **input** data: `penguins` 1. Next, we need to **group** the observations by `species` 1. Finally, we need to calculate the **mean** of the variable `body_mass_g` for all groups ``` ## # A tibble: 5 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## # ℹ 2 more variables: sex <fct>, year <int> ``` ] .panel[.panel-name[Code] ```r data(penguins) penguins_species <- group_by(.data = penguins, species) summarize(.data = penguins_species, avg_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 x 2 ## species avg_mass ## <fct> <dbl> ## 1 Adelie 3701. ## 2 Chinstrap 3733. ## 3 Gentoo 5076. ``` ] ] <!-- count(penguins, species) table(penguins$species) Before we dig deeper into dplyr useful to know operator in R because you will need to use operators a lot with dplyr, for example with filter --> --- class: inverse, middle # Operators --- ### Assignment Operators ``` x <- 5 # store value into a variable x = 5 # pass a value to a function's argument ``` -- ``` penguins_species <- group_by(.data = penguins, species) ``` --- ### Logical Operators ``` x == x # is equal (TRUE or FALSE) x != y # is not equal (TRUE or FALSE) x < y # less than x <= y # less than or equal to y > x # more than y >= # more than or equal to ``` -- ``` penguins_adelie <- filter(.data = penguins, species == "Adelie") ``` -- ``` penguins_no_adelie <- filter(.data = penguins, species != "Adelie") ``` -- ``` penguins_heavy <- filter(.data = penguins, body_mass_g > 4500) ``` --- ### More Logical Operators ``` x | y # x OR y x & y # x AND y x &! y # x AND NOT Y ``` -- -- Example use of the `|` operator using extended syntax... ``` penguins_adelie_chin <- filter(.data = penguins, species == "Adelie" | species == "Chinstrap") ``` -- ...and using the `%in%` syntax: ``` penguins_adelie_chin <- filter(.data = penguins, species %in% c("Adelie", "Chinstrap")) ``` <!-- ask: compare the two code, what do you notice?: parentheses .data what arguments does filter take --> --- ### Rewriting code more efficiently All the code we have seen so far can be rewritten more efficiently, and we will explore how to do this later today (hint: with pipes!). For now, a quick improvement is to pass the data name directly instead of specifying the argument name. -- Compare: ``` # extended syntax penguins_adelie_chin <- filter(.data = penguins, species %in% c("Adelie", "Chinstrap")) # more common syntax penguins_adelie_chin <- filter(penguins, species %in% c("Adelie", "Chinstrap")) ``` --- ### In-class Practice: operators with `filter()` These operators are often used together with the verb `filter()` in `dplyr`. Let's practice using them with our penguins data: * TASK 1: Get all Adelie penguins with flipper length longer or equal than 180 * TASK 2: Get all female penguins on the Dream island with body mass between 3000 and 4000 included * TASK 3: Get all penguins on the Dream and Torgersen islands that are not female <!-- I am going to type that in R the first two, and you tell me what to type! ### PRACTICE: Using operators with dplyr `filter()` TASK 1: Get all Adelie penguins with flipper length longer or equal than 180 ``` library(palmerpenguins) data(penguins) # filter only filter(.data = penguins, species == "Adelie" & flipper_length_mm >= 180) # filter and save in adelie_fl_180 variabel # count nrow(adelie_fl_180) count(adelie_fl_180) ``` TASK 2: Select all female penguins on the Dream island with body mass between 3000 and 4000 included ``` filter(penguins, sex == "female" & island == "Dream" & body_mass_g %in% 3000:4000) ``` OR ``` filter(penguins, sex == "female" & island == "Dream" & between(body_mass_g, 3000, 4000)) ``` TASK 3: Get all penguins on the Dream and Torgersen islands that are not female filter(penguins, island %in% c("Dream", "Torgersen") & sex != "female") filter(penguins, island %in% c("Dream", "Torgersen") & sex == "male") filter(penguins, (island == "Dream" | island == "Torgersen") & sex != "female") --> --- class: inverse, middle # Main `dplyr` functions --- <!-- <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/dplyr_wrangling.png" alt="Cartoon of a fuzzy monster with a cowboy hat and lasso, riding another fuzzy monster labeled 'dplyr', lassoing a group of angry / unruly looking creatures labeled 'data.'" width="55%" style="display: block; margin: auto;" /> .footnote[Source of image: [Allison Horst](https://github.com/allisonhorst/stats-illustrations)] --> ### Recap **`dplyr` is our main tool for data transformation in R. Conceptually, we have learned that performing (any) data transformation in `dplyr` requires us to:** 1. Get the data frame 1. Use `dplyr` verbs/functions to tell R what to do with the data frame. These functions: * work like grammatical verbs: define actions to be performed on the data * these verbs can be combined together to perform powerful manipulation tasks 1. Generate a new data frame that holds the results -- There are many `dpylr` functions! **Our goal: memorize the most important ones and look up (here: https://dplyr.tidyverse.org/) the others as needed...** So, what are the most important `dplyr` functions? --- ### Main `dplyr` functions `function()` | Action performed --------------|-------------------------------------------------------- `filter()` | Picks observations from the data frame based on their values (operates on rows) `arrange()` | Changes the order of observations, based on their values (operates on rows) `select()` | Picks variables from the data frame based on their names (operates on columns) `rename()` | Changes the name of columns in the data frame `mutate()` | Creates new columns from existing ones `group_by()` | Changes the unit of analysis from the complete data frame to individual groups `summarize()` | Collapses the data frame to a smaller number of rows to summarize the larger data (commonly used with mean, sum, n-distinct, etc.) --- ### American vs. British English `dplyr` accepts both spellings, but just for clarity: * US `summarize()` = UK `summarise()` * US `color()` = UK `colour()` <!-- <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The holy grail: "For consistency, aim to use British (rather than American) spelling." <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> <a href="http://t.co/7qQSWIowcl">http://t.co/7qQSWIowcl</a>. Colour is right!</p>— Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/405707093770244097">November 27, 2013</a></blockquote> <script async src="http://platform.twitter.com/widgets.js" charset="utf-8"></script> --> --- ### In-class Practice: What is the average bill length and body mass for Adelie penguins by sex? With your colleague(s)... 1. First, THINK: How you would conceptually approach this question? Break down the steps clearly and think them through. 2. Then, ACT: translate these steps into R code using the relevant `dplyr` verbs from the previous slide <!--
−
+
05
:
00
--> --- ### Solution 1: group by, then filter ```r penguins_sex <- group_by(.data = penguins, sex) penguins_sex_adelie <- filter(.data = penguins_sex, species == "Adelie") summarize( .data = penguins_sex_adelie, avg_bill = mean(bill_length_mm, na.rm = TRUE), avg_mass = mean(body_mass_g, na.rm = TRUE) ) ``` ``` ## # A tibble: 3 x 3 ## sex avg_bill avg_mass ## <fct> <dbl> <dbl> ## 1 female 37.3 3369. ## 2 male 40.4 4043. ## 3 <NA> 37.8 3540 ``` --- ### Solution 2: filter, then group by ```r penguins_adelie <- filter(.data = penguins, species == "Adelie") penguins_adelie_sex <- group_by(.data = penguins_adelie, sex) summarize( .data = penguins_adelie_sex, avg_bill = mean(bill_length_mm, na.rm = TRUE), avg_mass = mean(body_mass_g, na.rm = TRUE) ) ``` ``` ## # A tibble: 3 x 3 ## sex avg_bill avg_mass ## <fct> <dbl> <dbl> ## 1 female 37.3 3369. ## 2 male 40.4 4043. ## 3 <NA> 37.8 3540 ``` --- ### Saving transformed data .panelset[ .panel[.panel-name[Printed, but not saved] ```r filter(diamonds, cut == "Ideal") ``` ``` ## # A tibble: 21,551 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # ... with 21,541 more rows ``` ] .panel[.panel-name[Saved, but not printed] ```r diamonds_ideal <- filter(.data = diamonds, cut == "Ideal") ``` ] .panel[.panel-name[Saved and printed] ```r (diamonds_ideal <- filter(.data = diamonds, cut == "Ideal")) ``` ``` ## # A tibble: 21,551 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # ... with 21,541 more rows ``` ] ] --- class: inverse, middle # Pipes `%>%` or `|>` --- ### Pipes `%>%` or `|>` > Pipes in R allow you to write a sequence of multiple operations by **passing the result of one function to another one, in sequence** Compare these two chunks of codes: ``` # without pipes penguins_adelie <- filter(penguins, species == "Adelie") penguins_adelie_island <- group_by(penguins_adelie, island) summarize(penguins_adelie_island, body_mass = mean(body_mass_g, na.rm = TRUE)) # with pipes penguins %>% filter(species == "Adelie") %>% group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` --- ### Pipes `%>%` or `|>` Pipes simplify your code and make your operations more intuitive, BUT they are not the only way to write your R code. In fact, R didn't have pipes for a long time! *Let's look at ways you can write the same code with and without pipes...* --- ### Pipes `%>%` or `|>` Imagine we are given the following task: **using the penguins dataset, calculate the average body mass for Adelie penguins on different islands** First, THINK: break down the problem into smaller steps 1. Filter penguins to only keep observations where `species` is Adelie 1. Group the filtered penguins by `island` 1. Summarize the grouped and filtered penguins by calculating the average body mass Then, ACT: How do we implement the code? --- ### Option 1: save each step in a new data frame ```r penguins_adelie <- filter(penguins, species == "Adelie") penguins_adelie_island <- group_by(penguins_adelie, island) penguins_final <- summarize(penguins_adelie_island, body_mass = mean(body_mass_g, na.rm = TRUE)) print(penguins_final) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` -- This is valid code. Drawback: we have to save each intermediate object. Tedious and prone to errors since we must remember that the data exists at each step and reference the correct one. You might save each intermediate object with shorter name to prevent typos, but it won't be good self-documentation. --- ### Option 2: replace the original data frame ```r penguins <- filter(penguins, species == "Adelie") penguins <- group_by(penguins, island) (penguins <- summarize(penguins, body_mass = mean(body_mass_g, na.rm = TRUE))) ``` ``` ## # A tibble: 3 x 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` -- Instead of creating intermediate objects, we can overwrite the original data frame with the modified one. This is also valid code. Drawback: if we make an error in the middle of the process, we need to re-do the entire operation from the beginning, because we are writing over the original data. Not the best! --- ### Option 3: function composition ```r data(penguins) summarize(group_by(filter(penguins, species == "Adelie"), island), body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 x 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` -- This is valid code. Drawback: hard to read for humans (we need to read it from the inside out) and is prone to errors. --- ### Option 4: pipes (the winner!) ```r penguins %>% filter(species == "Adelie") %>% group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 x 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` -- This is valid and clear code! Notice the clearer syntax and the focus on **actions**, not objects. Pipes **chains a series of functions together**: they automatically pass the output from the first function to the next one as the input, producing code that is easily readable by humans. <!-- ## Piping (`%>%`) .panelset.sideways[ .panel[.panel-name[No pipes] ```r by_dest <- group_by( .data = flights, dest ) delays <- summarise( .data = by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delays <- filter( .data = delays, count > 20, dest != "HNL" ) ``` ] .panel[.panel-name[With pipes] ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter( count > 20, dest != "HNL" ) ``` ] ] --> --- ### Common pipes errors: examples with `flights` data ```r library(nycflights13) data(flights) head(flights) ``` ``` ## # A tibble: 6 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` This is a dataset of all flights (n = 336,776) that departed from NYC in 2013. --- ### Common Pipe Errors: Example 1 TASK: Group flights by destination (`dest`), calculate their average delay (`arr_delay`), remove destinations with fewer than 20 flights. *What's wrong with our code?* -- .pull-left[ #### Invalid code ``` delays <- flights %>% by_dest <- group_by(dest) %>% delay <- summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE) ) %>% d <- filter(count > 20) ``` ] -- .pull-right[ #### Correct code ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20) ``` ] Don’t assign anything within the pipes: **don't not use <- inside the piped operation** for intermediate steps. Only use it at the beginning, if you want to save the output. --- ### Common Pipe Errors: Example 2 .pull-left[ #### Invalid code ``` delays <- flights %>% group_by(dest) summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE)) filter(count > 20) ``` ] -- .pull-right[ #### Correct code ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20) ``` ] Remember to **add the pipe at the end of each line involved in the piped operation.** --- ### Common Pipe Errors: Example 3 .pull-left[ #### Invalid code ``` delays <- flights %>% group_by(.data = flights, dest) %>% summarize(.data = flights, count = n(), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(.data = flights, count > 20) ``` ] -- .pull-right[ #### Correct code ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20) ``` ] When using pipes, **only reference the data frame once at the beginning of the pipe sequence**; you don't need to reference it with each function. --- ### Common Pipe Errors: Example 4 .pull-left[ #### Invalid code ``` delays <- flights + group_by(dest) + summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE) ) + filter(count > 20) ``` ] -- .pull-right[ #### Correct code ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20) ``` ] **Don't use the `+` sign:** we are not adding layers to build a graph as in `ggplot2`. Instead, we are using multiple `dplyr` functions to transform data. --- class: inverse, middle ## Practice using `dplyr`, pipes, and operators Download today's in-class exercises from the website.