MACS 30500 LECTURE 3

---

# Agenda

* Intro to `dplyr` and to Programming as Problem-Solving

* Operators

* Main `dplyr` functions

* Pipes, which are written as `%>%` or `|>`

---

# Intro to `dplyr` and to Programming as Problem-Solving

---

## `penguins`

[Meet the Palmer Penguins!]( https://allisonhorst.github.io/palmerpenguins/#meet-the-palmer-penguins)

The palmerpenguins package (already installed on Workbench) contains two datasets:

* `penguins`: clean data on three species of penguins (Adelie, Chinstrap, Gentoo) from three islands in Antartica; total data 244 penguins

* `penguins_raw`: raw data

Today we will be using the first of the two: `penguins`

---

## `penguins`

```r
library(tidyverse)
library(palmerpenguins)
head(penguins)
```

```
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>
```

If I ask "**What is the average body mass of Adelie penguins?**" Think about (1) the *conceptual steps* you need to answer this question, and (2) how to *translate them into `dplyr` code*

---

### Q1: What is the average body mass of an Adelie penguin?

1. First, we need to get the **input** data: `penguins`
1. Next, we need to **filter** only the observations classified as `species` Adelie
1. Finally, we need to calculate the **mean** of the variable `body_mass_g` for this group

```
## # A tibble: 5 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## # ℹ 2 more variables: sex <fct>, year <int>
```

]

```r
data(penguins)
penguins_adelie <- filter(.data = penguins, species == "Adelie")
summarize(.data = penguins_adelie, avg_mass_adelie = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 1 × 1
##   avg_mass_adelie
##             <dbl>
## 1           3701.
```

]
]

---

### Q2: What is the average body mass for each penguin species?

1. First, we need to get the **input** data: `penguins`
1. Next, we need to **group** the observations by `species`
1. Finally, we need to calculate the **mean** of the variable `body_mass_g` for all groups

]

```r
data(penguins)
penguins_species <- group_by(.data = penguins, species)
summarize(.data = penguins_species, avg_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 x 2
##   species   avg_mass
##   <fct>        <dbl>
## 1 Adelie       3701.
## 2 Chinstrap    3733.
## 3 Gentoo       5076.
```

]
]

<!--
count(penguins, species)
table(penguins$species)

Before we dig deeper into dplyr useful to know operator in R because you will need to use operators a lot with dplyr, for example with filter 
-->

---

# Operators

---

### Assignment Operators
```
x <- 5    # store value into a variable
x = 5     # pass a value to a function's argument
```

```
penguins_species <- group_by(.data = penguins, species)
```

---

### Logical Operators
```
x == x    # is equal (TRUE or FALSE)
x != y    # is not equal (TRUE or FALSE) 
x < y     # less than
x <= y    # less than or equal to 
y > x     # more than
y >=      # more than or equal to
```

```
penguins_adelie <- filter(.data = penguins, species == "Adelie")
```

```
penguins_no_adelie <- filter(.data = penguins, species != "Adelie")
```

```
penguins_heavy <- filter(.data = penguins, body_mass_g > 4500)
```

---

### More Logical Operators
```
x | y     # x OR y
x & y     # x AND y
x &! y    # x AND NOT Y
```

Example use of the `|` operator using extended syntax...
```
penguins_adelie_chin <- filter(.data = penguins, 
                                species == "Adelie" | species == "Chinstrap")
```

...and using the `%in%` syntax:
```
penguins_adelie_chin <- filter(.data = penguins, species %in% c("Adelie", "Chinstrap"))
```

---

### Rewriting code more efficiently

All the code we have seen so far can be rewritten more efficiently, and we will explore how to do this later today (hint: with pipes!).

For now, a quick improvement is to pass the data name directly instead of specifying the argument name.

Compare:

```
# extended syntax
penguins_adelie_chin <- filter(.data = penguins, species %in% c("Adelie", "Chinstrap"))
                
# more common syntax
penguins_adelie_chin <- filter(penguins, species %in% c("Adelie", "Chinstrap"))
```

---

### In-class Practice: operators with `filter()`

These operators are often used together with the verb `filter()` in `dplyr`. Let's practice using them with our penguins data:

* TASK 1: Get all Adelie penguins with flipper length longer or equal than 180

* TASK 2: Get all female penguins on the Dream island with body mass between 3000 and 4000 included

* TASK 3: Get all penguins on the Dream and Torgersen islands that are not female

<!--

I am going to type that in R the first two, and you tell me what to type!

### PRACTICE: Using operators with dplyr `filter()`

TASK 1: Get all Adelie penguins with flipper length longer or equal than 180

```
library(palmerpenguins)
data(penguins)

# filter only
filter(.data = penguins, species == "Adelie" & flipper_length_mm >= 180)

# filter and save in adelie_fl_180 variabel

# count
nrow(adelie_fl_180)
count(adelie_fl_180)
```

TASK 2: Select all female penguins on the Dream island with body mass between 3000 and 4000 included

```
filter(penguins, sex == "female" & island == "Dream" & body_mass_g %in% 3000:4000) 
```
OR

```
filter(penguins, sex == "female" & island == "Dream" & between(body_mass_g, 3000, 4000))
```

TASK 3: Get all penguins on the Dream and Torgersen islands that are not female

filter(penguins, island %in% c("Dream", "Torgersen") & sex != "female")
filter(penguins, island %in% c("Dream", "Torgersen") & sex == "male")

filter(penguins, (island == "Dream" | island == "Torgersen") & sex != "female")
-->

---

# Main `dplyr` functions

---

<!--
<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/dplyr_wrangling.png" alt="Cartoon of a fuzzy monster with a cowboy hat and lasso, riding another fuzzy monster labeled 'dplyr', lassoing a group of angry / unruly looking creatures labeled 'data.'" width="55%" style="display: block; margin: auto;" />

### Recap

**`dplyr` is our main tool for data transformation in R. Conceptually, we have learned that performing (any) data transformation in `dplyr` requires us to:**

1. Get the data frame

1. Use `dplyr` verbs/functions to tell R what to do with the data frame. These functions: 
  * work like grammatical verbs: define actions to be performed on the data
  * these verbs can be combined together to perform powerful manipulation tasks

1. Generate a new data frame that holds the results

There are many `dpylr` functions! **Our goal: memorize the most important ones and look up (here: https://dplyr.tidyverse.org/) the others as needed...** So, what are the most important `dplyr` functions?

---

### Main `dplyr` functions

`function()`  | Action performed
--------------|--------------------------------------------------------
`filter()`    | Picks observations from the data frame based on their values (operates on rows)
`arrange()`   | Changes the order of observations, based on their values (operates on rows)
`select()`    | Picks variables from the data frame based on their names (operates on columns)
`rename()`    | Changes the name of columns in the data frame
`mutate()`    | Creates new columns from existing ones
`group_by()`  | Changes the unit of analysis from the complete data frame to individual groups
`summarize()` | Collapses the data frame to a smaller number of rows to summarize the larger data (commonly used with mean, sum, n-distinct, etc.)

---

### American vs. British English

`dplyr` accepts both spellings, but just for clarity:

* US `summarize()` = UK `summarise()`

* US `color()` = UK `colour()`

---

### In-class Practice: What is the average bill length and body mass for Adelie penguins by sex?

With your colleague(s)...

1. First, THINK: How you would conceptually approach this question?  Break down the steps clearly and think them through.

2. Then, ACT: translate these steps into R code using the relevant `dplyr` verbs from the previous slide

---

### Solution 1: group by, then filter

```r
penguins_sex <- group_by(.data = penguins, sex)
penguins_sex_adelie <- filter(.data = penguins_sex, species == "Adelie")
summarize(
  .data = penguins_sex_adelie,
  avg_bill = mean(bill_length_mm, na.rm = TRUE),
  avg_mass = mean(body_mass_g, na.rm = TRUE)
)
```

```
## # A tibble: 3 x 3
##   sex    avg_bill avg_mass
##   <fct>     <dbl>    <dbl>
## 1 female     37.3    3369.
## 2 male       40.4    4043.
## 3 <NA>       37.8    3540
```

---

### Solution 2: filter, then group by

```r
penguins_adelie <- filter(.data = penguins, species == "Adelie")
penguins_adelie_sex <- group_by(.data = penguins_adelie, sex)
summarize(
  .data = penguins_adelie_sex,
  avg_bill = mean(bill_length_mm, na.rm = TRUE),
  avg_mass = mean(body_mass_g, na.rm = TRUE)
)
```

```
## # A tibble: 3 x 3
##   sex    avg_bill avg_mass
##   <fct>     <dbl>    <dbl>
## 1 female     37.3    3369.
## 2 male       40.4    4043.
## 3 <NA>       37.8    3540
```

---

### Saving transformed data

```r
filter(diamonds, cut == "Ideal")
```

```
## # A tibble: 21,551 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
##  3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
##  4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
##  5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
##  7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
## 10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
## # ... with 21,541 more rows
```

]

```r
diamonds_ideal <- filter(.data = diamonds, cut == "Ideal")
```

]

```r
(diamonds_ideal <- filter(.data = diamonds, cut == "Ideal"))
```

]
]

---

# Pipes `%>%` or `|>`

---

### Pipes `%>%`  or `|>`

> Pipes in R allow you to write a sequence of multiple operations by **passing the result of one function to another one, in sequence**

Compare these two chunks of codes:
```
# without pipes
penguins_adelie <- filter(penguins, species == "Adelie")
penguins_adelie_island <- group_by(penguins_adelie, island)
summarize(penguins_adelie_island, body_mass = mean(body_mass_g, na.rm = TRUE))

# with pipes
penguins %>%
  filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

---

### Pipes `%>%`  or `|>`

Pipes simplify your code and make your operations more intuitive, BUT they are not the only way to write your R code.

In fact, R didn't have pipes for a long time!

*Let's look at ways you can write the same code with and without pipes...*

---

### Pipes `%>%`  or `|>`

Imagine we are given the following task:
**using the penguins dataset, calculate the average body mass for Adelie penguins on different islands**

First, THINK: break down the problem into smaller steps
1. Filter penguins to only keep observations where `species` is Adelie
1. Group the filtered penguins by `island`
1. Summarize the grouped and filtered penguins by calculating the average body mass

Then, ACT: How do we implement the code?

---

### Option 1: save each step in a new data frame

```r
penguins_adelie <- filter(penguins, species == "Adelie")
penguins_adelie_island <- group_by(penguins_adelie, island)
penguins_final <- summarize(penguins_adelie_island, 
                            body_mass = mean(body_mass_g, na.rm = TRUE))
print(penguins_final)
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

This is valid code. Drawback: we have to save each intermediate object. Tedious and prone to errors since we must remember that the data exists at each step and reference the correct one. You might save each intermediate object with shorter name to prevent typos, but it won't be good self-documentation. 
---

### Option 2: replace the original data frame

```r
penguins <- filter(penguins, species == "Adelie")
penguins <- group_by(penguins, island)
(penguins <- summarize(penguins, body_mass = mean(body_mass_g, na.rm = TRUE)))
```

```
## # A tibble: 3 x 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

Instead of creating intermediate objects, we can overwrite the original data frame with the modified one. This is also valid code. Drawback: if we make an error in the middle of the process, we need to re-do the entire operation from the beginning, because we are writing over the original data. Not the best!

---

### Option 3: function composition

```r
data(penguins)
summarize(group_by(filter(penguins, species == "Adelie"), island), 
          body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 x 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

This is valid code. Drawback: hard to read for humans (we need to read it from the inside out) and is prone to errors.

---

### Option 4: pipes (the winner!)

```r
penguins %>%
  filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 x 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

This is valid and clear code! Notice the clearer syntax and the focus on **actions**, not objects.

Pipes **chains a series of functions together**: they automatically pass the output from the first function to the next one as the input, producing code that is easily readable by humans.

<!--

## Piping (`%>%`)

.panelset.sideways[
.panel[.panel-name[No pipes]

```r
by_dest <- group_by(
  .data = flights,
  dest
)

delays <- summarise(
  .data = by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)

delays <- filter(
  .data = delays,
  count > 20,
  dest != "HNL"
)
```

]

```r
delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(
    count > 20,
    dest != "HNL"
  )
```

]

]
-->

---

### Common pipes errors: examples with `flights` data

```r
library(nycflights13)
data(flights)
head(flights)
```

```
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```

This is a dataset of all flights (n = 336,776) that departed from NYC in 2013.

---

### Common Pipe Errors: Example 1

TASK: Group flights by destination (`dest`), calculate their average delay (`arr_delay`), remove destinations with fewer than 20 flights. *What's wrong with our code?*

#### Invalid code

```
delays <- flights %>% 
  by_dest <- group_by(dest) %>% 
  delay <- summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  d <- filter(count > 20)
```

]

#### Correct code

```r
delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20)
```

]

Don’t assign anything within the pipes: **don't not use <- inside the piped operation** for intermediate steps. Only use it at the beginning, if you want to save the output.

---

### Common Pipe Errors: Example 2

#### Invalid code

```
delays <- flights %>%
  group_by(dest)
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE))
  filter(count > 20)
```

]

#### Correct code

```r
delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
    ) %>%
  filter(count > 20)
```

]

Remember to **add the pipe at the end of each line involved in the piped operation.**

---

### Common Pipe Errors: Example 3

#### Invalid code

```
delays <- flights %>% 
  group_by(.data = flights, dest) %>% 
  summarize(.data = flights,
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(.data = flights, count > 20)
```
]

#### Correct code

```r
delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20)
```

]

When using pipes, **only reference the data frame once at the beginning of the pipe sequence**; you don't need to reference it with each function.

---

### Common Pipe Errors: Example 4

#### Invalid code

```
delays <- flights +
  group_by(dest) +
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
    ) +
  filter(count > 20)
```
]

#### Correct code

```r
delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20)
```

]

**Don't use the `+` sign:** we are not adding layers to build a graph as in `ggplot2`. Instead, we are using multiple `dplyr` functions to transform data.

---

## Practice using `dplyr`, pipes, and operators

Download today's in-class exercises from the website.