MACS 30500 LECTURE 10

---

# Agenda
  
  * What are Control Structures? Definition and Main Control Structures
  
  * For loops
  
  * Alternatives to for loops in R

* While loops
  
<!-- NOTES TO IMPROVE CURRENT LOOP SLIDES

Add a few slides that show the differences of looping over 
indexes or looping over elements and teach that concept form the slides; then leave the in-class demo code with the key and let students explore it in team and go around for questions. Total time 10 min slides and 10 go over code + 5 review

In the demo for loop: add code on data structure, e.g. show how to access columns of a df and their elements with the double and single square brackets

Show difference of accessing single column in a df (that is a vector so use []) or elements of columns for which you need to use [[]]

When you teach for loops add the break and continue statements (currently not in slides!)
-->

---

## What are Control Structures?

---

### Control Structures: Definition

All code we have written so far can be seen as a finite and fixed sequence of commands.

What's next?

**Control structures allow us to change the flow of execution in our code. By incorporating logical conditions, they enable different lines of code to run based on the given conditions.**

This approach differs from executing the same code in the same way each time, like we have done so far in this course!

---

### Main Control Structures

* **Conditional statements**: test one or more condition(s) and act on it or them

* **`for` loop**: execute a block of code for a fixed number of times

* **`while` loop**: execute a block of code while a given condition is true, and stops only once the condition is evaluated as false

---

## For loops
 
 * For loop definition and simple examples
 * Demo example: same task with and without a for loop on a dataframe

---

### Definition of for loops

For loops are the most common looping construct in many programming languages to **iterate over the elements of an object** (usually a list or vector) and do something on each one of them.

Syntax:
```
for (item in sequence of items) {
  statement(s)    
}
```

Example:
```
for (item in c(1:3)) {
  print(item)
}
```

---

### For loop example #1

```r
for (item in c(1:3)) {
  print(item)
}
```

```
## [1] 1
## [1] 2
## [1] 3
```

Let's unpack this example:

* the statement executed here is simple: we print `item` using the `print()` function

* `item` is a placeholder: during each iteration of the loop, the its specific value changes

* the number of times the statement block repeats depends on the number of items in the sequence of numbers provided: in this example three times

* `item` can be labeled any name you like, R does not care as long as you are consistent

---

### For loop example #2

For loops can be nested. In this case, the **outer loop determines the number of complete iterations of the inner loop**: for each execution of the outer loop, the inner loop runs N times

```
for (i in c(1:3)) {
  print(i)
  for (j in c("cat", "dog", "squirrel", "rabbit")) {
    print(j)
  }
}
```

What will this nested loop output?

---

### For loop example #3

```
for (i in c(1:3)) {
  print(i)
  print("Hello")
  sum <- i + 100
  print(sum)
}
```

What will this loop output?

---

### Save for loop output: rewrite example #3

In these examples, we are not saving the output of our for loop: we are only printing it. However, in practice, we usually want to save the results. We can rewrite the previous example to store the results of the one operation we are doing (sum), like this:

```r
output <- vector(mode = "integer", length = length(c(1:3)))

for (i in c(1:3)) {
  output[i] <- i + 100
}

output
```

```
## [1] 101 102 103
```

```r
length(output)
```

```
## [1] 3
```

---

### Back to the definition

For loops are the most common looping construct in many programming languages to **iterate over the elements of an object** (usually a list or vector) and do something on each one of them.

For loops include three components:

* output: to store results (best practice: pre-allocate the output with a given length)

* sequence: what goes in the loop

* body: statements/actions to be executed every time through the loop

---

### Same task *without* and *with* a for loop on a data frame

Get the data:

```r
library(tidyverse)
library(palmerpenguins)
data(penguins)
```

Transform the data and save in a new data frame:

```r
penguins_clean <- select(penguins, 3:6) %>% drop_na()
head(penguins_clean, n = 3)
```

```
## # A tibble: 3 × 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <int>       <int>
## 1           39.1          18.7               181        3750
## 2           39.5          17.4               186        3800
## 3           40.3          18                 195        3250
```

---

### Calculate the mean value of each column *without a loop*

To calculate the mean value of each column of this toy dataframe, we can take the `mean()` function, and apply it to each column. Recall, we can use the `$` sign to access columns within a data frame.

```
mean(penguins_clean$bill_length_mm)
mean(penguins_clean$bill_depth_mm)
mean(penguins_clean$flipper_length_mm)
mean(penguins_clean$body_mass_g)
```

This works, but is a lot of copying/pasting...

---

### We can do the same but *with a loop*

First, initialize an empty vector to store results. 
Second, apply a for loop to each column of this toy dataframe.

```r
output <- vector(mode = "double", length = ncol(penguins_clean))

for (i in seq_along(penguins_clean)) {
  output[[i]] <- mean(penguins_clean[[i]])
}
output
```

```
## [1]   43.92193   17.15117  200.91520 4201.75439
```

### Let's unpack this example: open the **demo.Rmd** from today's class materials!

<!--

### Benefits of preallocation

This explains why we are pre-allocating in the first place, and why we do so with a vector:
having an object that is already of the same length of the output, where we are just plugging in  individual values increases speed, rather the more naive approach in which we store reuslts using an  mpty vector or an empty other object (e.g. a dataframe) of length zero, and then append or add on each of the values as we calculate them

For example, let's take this mpg data (built in dataframe in R about auto, we do not really care about the content of the data); here what we are doing is creating duplicates of that dataframe 100 times and we are then putting them together into a single data frame.

Without preallocation: we can create an empty dataframe (here with the tibble function), iterate over 100 times, take this empty dataframe and combine the rows of it with the rows of the original dataframe, and replace the original object with the new copy and save in output (so we are appending 100 rows every time we iterate!)

If we do proper preallocation: we create a list of 100 empty elements, every time we store the results in the list, then we use the bind_rows() functions at the end

The first approach does not preallocate by creating an empty space to store the output, the second does. See the difference in time of execution. From 80 milliseconds to less than 3. So you can see how inefficient is not to allocate since most of our data will have more than 100 rows!

.panelset[
.panel[.panel-name[Code]
```r
# no preallocation
mpg_no_preall <- tibble()
for(i in 1:100){
  mpg_no_preall <- bind_rows(mpg_no_preall, mpg)
}

# with preallocation using a list
mpg_preall <- vector(mode = "list", length = 100)
for(i in 1:100){
  mpg_preall[[i]] <- mpg
}
mpg_preall <- bind_rows(mpg_preall)
```
]

.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" />
]
]
-->

---

## Alternatives to for loops in R

Writing loops within a dataframe is possible, and sometimes is advisable. However, R provides alternatives to for loops that are *generally* better to use with dataframes:

* Iteration with `map_*()` functions
  * Iteration with `across()`

---

## Map functions

**In R `for` loops are good, but `map()` functions may be even better!**

These functions come from the `purr` package in R: https://purrr.tidyverse.org/

There are **different `map()` functions**  each creates a different type of output (this is the same idea as in the `for loop` when we specify the `mode` of our output vector):

- `map()` makes a list
- `map_lgl()` makes a logical vector
- `map_int()` makes an integer vector
- `map_dbl()` makes a double vector
- `map_chr()` makes a character vector

---

## Map functions

Let's see a few examples using the same `penguins_clean` dataset we have been using so far:

```r
penguins_clean <- select(penguins, 3:6) %>% drop_na()

head(penguins_clean, n = 3)
```

---

## Map functions

Pick the appropriate `map()` function and specify at least two main arguments (for more options check the documentation!): 
* what you are iterating over
* what you are calculating

```r
map_dbl(penguins_clean, median)
```

```
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##             44.45             17.30            197.00           4050.00
```

```r
map_dbl(penguins_clean, mean, na.rm = TRUE)
```

```
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          43.92193          17.15117         200.91520        4201.75439
```

---

## Map functions

We can use `map()` functions also with the `%>%` operator:

```r
penguins_clean %>%
  map_dbl(mean, na.rm = TRUE)
```

```
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          43.92193          17.15117         200.91520        4201.75439
```

---

## Across function

We’ve seen how to use loops and `map()` functions to solve the same task. **Let's review one final approach: the `across()` function.**

Notice:
* `across()` comes from the `dplyr` package
* `map()` functions come from the `purr` package

Advantages:
* `across()` makes it easy to apply the same transformation to multiple columns in a data frame
* since it comes from `dplyr()`, it is well integrated with `dplyr` verbs!

---

### Single column

Using the `dplyr` verb `summarize()`, we can easily calculate the mean of one column without the help of `map()` or `across()`:

```r
penguins_clean %>%
  summarize(mean_bill_length = mean(bill_length_mm))
```

```
## # A tibble: 1 × 1
##   mean_bill_length
##              <dbl>
## 1             43.9
```

---

### Multiple columns

We can extend the same operation to multiple columns, as follows:

```r
penguins_clean %>%
  summarize(
    bill_length = mean(bill_length_mm),
    bill_depth = mean(bill_depth_mm),
    flipper_length = mean(flipper_length_mm),
    body_mass = mean(body_mass_g)
  )
```

```
## # A tibble: 1 × 4
##   bill_length bill_depth flipper_length body_mass
##         <dbl>      <dbl>          <dbl>     <dbl>
## 1        43.9       17.2           201.     4202.
```

This works... but we can do this same operator more efficiently using `across()`

---

### `dplyr::across()`

`across()` has two main arguments:

* `.cols`: columns you want to operate on; can select columns by position, name, and type
* `.fns`: is a function or list of functions to apply to each column

We now examine a few examples of `across()` in conjunction with its favorite verb, `summarize()`

Check the documentation for more: https://dplyr.tidyverse.org/reference/across.html

---

### `summarize()`, `across()`, and `everything()`

```r
# calculate the mean of all columns, use everything() to select all variables
penguins_clean %>%
  summarize(across(everything(), 
                   ~ mean(., na.rm = TRUE)))
```

```
## # A tibble: 1 × 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           43.9          17.2              201.       4202.
```
]

```r
# to apply multiple summaries, store the functions in a list
penguins_clean %>%
  summarize(across(everything(),
                   .fns = list(min, max)))
```

```
## # A tibble: 1 × 8
##   bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2
##              <dbl>            <dbl>           <dbl>           <dbl>
## 1             32.1             59.6            13.1            21.5
## # ℹ 4 more variables: flipper_length_mm_1 <int>, flipper_length_mm_2 <int>,
## #   body_mass_g_1 <int>, body_mass_g_2 <int>
```
]

```r
# provide names to variables, to clearly distinguish each summarized variable
penguins_clean %>%
  summarize(across(everything(), 
                   .fns = list(min = min, max = max)))
```

```
## # A tibble: 1 × 8
##   bill_length_mm_min bill_length_mm_max bill_depth_mm_min bill_depth_mm_max
##                <dbl>              <dbl>             <dbl>             <dbl>
## 1               32.1               59.6              13.1              21.5
## # ℹ 4 more variables: flipper_length_mm_min <int>, flipper_length_mm_max <int>,
## #   body_mass_g_min <int>, body_mass_g_max <int>
```
]
]

---

## More examples using the [`worldbank` data](https://data.worldbank.org/)

```r
data("worldbank", package = "rcis")
worldbank
```

```
## # A tibble: 78 x 14
##    iso3c date  iso2c country   perc_en~1 rnd_g~2 percg~3 real_~4 gdp_c~5 top10~6
##    <chr> <chr> <chr> <chr>         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 ARG   2005  AR    Argentina      89.1   0.379    15.5   6198.   5110.    35  
##  2 ARG   2006  AR    Argentina      88.7   0.400    22.1   7388.   5919.    33.9
##  3 ARG   2007  AR    Argentina      89.2   0.402    22.8   8182.   7245.    33.8
##  4 ARG   2008  AR    Argentina      90.7   0.421    21.6   8576.   9021.    32.5
##  5 ARG   2009  AR    Argentina      89.6   0.519    18.9   7904.   8225.    31.4
##  6 ARG   2010  AR    Argentina      89.5   0.518    17.9   8803.  10386.    32  
##  7 ARG   2011  AR    Argentina      88.9   0.537    17.9   9528.  12849.    31  
##  8 ARG   2012  AR    Argentina      89.0   0.609    16.5   9301.  13083.    29.7
##  9 ARG   2013  AR    Argentina      89.0   0.612    15.3   9367.  13080.    29.4
## 10 ARG   2014  AR    Argentina      87.7   0.613    16.1   8903.  12335.    29.9
## # ... with 68 more rows, 4 more variables: employment_ratio <dbl>,
## #   life_exp <dbl>, pop_growth <dbl>, pop <dbl>, and abbreviated variable names
## #   1: perc_energy_fosfuel, 2: rnd_gdpshare, 3: percgni_adj_gross_savings,
## #   4: real_netinc_percap, 5: gdp_capita, 6: top10perc_incshare
```

---

### `summarize()`, `across()`, and `where()`

```r
# use across() with where() to pick variables based on type (e.g. is.numeric(), etc.)
worldbank %>% 
  group_by(country) %>%
  summarize(across(.cols = where(is.numeric), .fns = mean, na.rm = TRUE))
```

```
## # A tibble: 6 x 11
##   country        perc_~1 rnd_g~2 percg~3 real_~4 gdp_c~5 top10~6 emplo~7 life_~8
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Argentina         89.1  0.501     17.5   8560.  10648.    31.6    55.4    75.4
## 2 China             87.6  1.67      48.3   3661.   5397.    30.8    69.8    74.7
## 3 Indonesia         65.3  0.0841    30.5   2041.   2881.    31.2    62.5    69.5
## 4 Norway            58.9  1.60      37.2  70775.  85622.    21.9    67.3    81.3
## 5 United Kingdom    86.3  1.68      13.5  34542.  43416.    26.2    58.7    80.4
## 6 United States     84.2  2.69      17.6  42824.  51285.    30.1    60.2    78.4
## # ... with 2 more variables: pop_growth <dbl>, pop <dbl>, and abbreviated
## #   variable names 1: perc_energy_fosfuel, 2: rnd_gdpshare,
## #   3: percgni_adj_gross_savings, 4: real_netinc_percap, 5: gdp_capita,
## #   6: top10perc_incshare, 7: employment_ratio, 8: life_exp
```
]

```r
# or pick variables based on type and whose name begins with "perc"
worldbank %>%
  group_by(country) %>%
  summarize(across(
    .cols = where(is.numeric) & starts_with("perc"),
    .fn = mean, na.rm = TRUE
  ))
```

```
## # A tibble: 6 x 3
##   country        perc_energy_fosfuel percgni_adj_gross_savings
##   <chr>                        <dbl>                     <dbl>
## 1 Argentina                     89.1                      17.5
## 2 China                         87.6                      48.3
## 3 Indonesia                     65.3                      30.5
## 4 Norway                        58.9                      37.2
## 5 United Kingdom                86.3                      13.5
## 6 United States                 84.2                      17.6
```
]

]

---

### `across()` and `filter()`

To use `across()` with `filter()`, we need an extra step: `if_any()` or `if_all()`

```r
# if_any() keeps rows where the predicate is true for at least one column
worldbank %>%
  filter(if_any(everything(), ~ !is.na(.x)))
```

```r
# if_all() keeps rows where the predicate is true for all selected columns
worldbank %>%
  filter(if_all(everything(), ~ !is.na(.x)))
```

---

## Practice

See today's class materials posted on the website

---

## While loops

* Definition
  * Examples
  * Main use

---

## Definition of while loops

* A while loop begins by evaluating a condition

* If the condition is TRUE, R executes the loop body

* Once the loop body is executed, R starts over: the condition is evaluated again, and so forth, until the condition is FALSE

* At that point, R stops the while loop

---

### While loop example

Syntax:
```
while (condition to be evaluated) {
  statement(s)
}
```

Example:

```r
counter <- 1

while(counter <= 4) {
  print(counter)
  counter <- counter + 1
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
```

---

### While loop example

Let's take the same example, but this time we print `counter` also at the end. Why are the results different?

```r
counter <- 1

while(counter <= 3) {
  print(counter)
  counter <- counter + 1
  print(counter)
}
```

```
## [1] 1
## [1] 2
## [1] 2
## [1] 3
## [1] 3
## [1] 4
```

---

### While loop example

Let's take the same example, but this time we do not increment our `counter` variable:

```
counter <- 1
while(counter < 3){
  print(counter)
}
```

What is the output of this code?

---

### While loop example

What is the output of this code?

```r
counter <- 1
while(counter < 4){
  print(counter)
  multiply <- counter * 100
  print(multiply)
  counter <- counter + 1
  print(counter)
}
```

```
## [1] 1
## [1] 100
## [1] 2
## [1] 2
## [1] 200
## [1] 3
## [1] 3
## [1] 300
## [1] 4
```

---

### While loops uses

While loops are best used **when you do not know the length of your input**: you do not know the exact numbre of times you need to iterate and want to continue until a condition is met

For example:

* Loop until you get three heads in a row in a random sequence of numbers
* Loop until you reach your target number for data collection (e.g. keep accepeting user inputs until you have enough responses from users)

While loops require a **"count variable"** to be set outside the loop.

While loops are important but **less common than for loops** especially for the types of tasks we do in this course. For this reason, we don’t cover them in-depth.

<!--
## Acknowledgments

The content of these slides is derived in part from Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone.
-->