class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 10 ] .author[ ### Topics: Loops ] --- class: inverse, middle # Agenda * What are Control Structures? Definition and Main Control Structures * For loops * Alternatives to for loops in R * While loops <!-- NOTES TO IMPROVE CURRENT LOOP SLIDES Add a few slides that show the differences of looping over indexes or looping over elements and teach that concept form the slides; then leave the in-class demo code with the key and let students explore it in team and go around for questions. Total time 10 min slides and 10 go over code + 5 review In the demo for loop: add code on data structure, e.g. show how to access columns of a df and their elements with the double and single square brackets Show difference of accessing single column in a df (that is a vector so use []) or elements of columns for which you need to use [[]] When you teach for loops add the break and continue statements (currently not in slides!) --> --- class: inverse, middle ## What are Control Structures? --- ### Control Structures: Definition All code we have written so far can be seen as a finite and fixed sequence of commands. What's next? **Control structures allow us to change the flow of execution in our code. By incorporating logical conditions, they enable different lines of code to run based on the given conditions.** This approach differs from executing the same code in the same way each time, like we have done so far in this course! --- ### Main Control Structures * **Conditional statements**: test one or more condition(s) and act on it or them * **`for` loop**: execute a block of code for a fixed number of times * **`while` loop**: execute a block of code while a given condition is true, and stops only once the condition is evaluated as false <!-- We will now explain each of these control structures, and do practice exercises, but make sure to review the assigned readings for further details: one is from Chapter 13 “Control Structures” in *R programming for Data Science* and Chapter 21 “Iteration” in *R for Data Science* for details. --> --- class: inverse, middle ## For loops * For loop definition and simple examples * Demo example: same task with and without a for loop on a dataframe --- ### Definition of for loops For loops are the most common looping construct in many programming languages to **iterate over the elements of an object** (usually a list or vector) and do something on each one of them. Syntax: ``` for (item in sequence of items) { statement(s) } ``` -- Example: ``` for (item in c(1:3)) { print(item) } ``` --- ### For loop example #1 ```r for (item in c(1:3)) { print(item) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ``` Let's unpack this example: * the statement executed here is simple: we print `item` using the `print()` function * `item` is a placeholder: during each iteration of the loop, the its specific value changes * the number of times the statement block repeats depends on the number of items in the sequence of numbers provided: in this example three times * `item` can be labeled any name you like, R does not care as long as you are consistent --- ### For loop example #2 For loops can be nested. In this case, the **outer loop determines the number of complete iterations of the inner loop**: for each execution of the outer loop, the inner loop runs N times ``` for (i in c(1:3)) { print(i) for (j in c("cat", "dog", "squirrel", "rabbit")) { print(j) } } ``` What will this nested loop output? <!-- for (i in 1:3) { print(i) for (j in c('a', 'b')) { print(i) print(paste(i, "outer")) print(j) print(paste(i,j)) } } --> --- ### For loop example #3 ``` for (i in c(1:3)) { print(i) print("Hello") sum <- i + 100 print(sum) } ``` What will this loop output? <!-- What is this loop doing? Have someone describing it what happens if I add a print(i) at the end outside the loop: prints last i--> --- ### Save for loop output: rewrite example #3 In these examples, we are not saving the output of our for loop: we are only printing it. However, in practice, we usually want to save the results. We can rewrite the previous example to store the results of the one operation we are doing (sum), like this: ```r output <- vector(mode = "integer", length = length(c(1:3))) for (i in c(1:3)) { output[i] <- i + 100 } output ``` ``` ## [1] 101 102 103 ``` ```r length(output) ``` ``` ## [1] 3 ``` <!-- Maybe add about here looping over indexes vs element of a loop --> --- ### Back to the definition For loops are the most common looping construct in many programming languages to **iterate over the elements of an object** (usually a list or vector) and do something on each one of them. For loops include three components: * output: to store results (best practice: pre-allocate the output with a given length) * sequence: what goes in the loop * body: statements/actions to be executed every time through the loop --- ### Same task *without* and *with* a for loop on a data frame Get the data: ```r library(tidyverse) library(palmerpenguins) data(penguins) ``` Transform the data and save in a new data frame: ```r penguins_clean <- select(penguins, 3:6) %>% drop_na() head(penguins_clean, n = 3) ``` ``` ## # A tibble: 3 × 4 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <dbl> <dbl> <int> <int> ## 1 39.1 18.7 181 3750 ## 2 39.5 17.4 186 3800 ## 3 40.3 18 195 3250 ``` --- ### Calculate the mean value of each column *without a loop* To calculate the mean value of each column of this toy dataframe, we can take the `mean()` function, and apply it to each column. Recall, we can use the `$` sign to access columns within a data frame. <!-- penguins_clean %>% summarize(mean_bill_length_mm = mean(bill_length_mm) --> ``` mean(penguins_clean$bill_length_mm) mean(penguins_clean$bill_depth_mm) mean(penguins_clean$flipper_length_mm) mean(penguins_clean$body_mass_g) ``` This works, but is a lot of copying/pasting... --- ### We can do the same but *with a loop* First, initialize an empty vector to store results. Second, apply a for loop to each column of this toy dataframe. ```r output <- vector(mode = "double", length = ncol(penguins_clean)) for (i in seq_along(penguins_clean)) { output[[i]] <- mean(penguins_clean[[i]]) } output ``` ``` ## [1] 43.92193 17.15117 200.91520 4201.75439 ``` ### Let's unpack this example: open the **demo.Rmd** from today's class materials! <!-- ### Benefits of preallocation This explains why we are pre-allocating in the first place, and why we do so with a vector: having an object that is already of the same length of the output, where we are just plugging in individual values increases speed, rather the more naive approach in which we store reuslts using an mpty vector or an empty other object (e.g. a dataframe) of length zero, and then append or add on each of the values as we calculate them For example, let's take this mpg data (built in dataframe in R about auto, we do not really care about the content of the data); here what we are doing is creating duplicates of that dataframe 100 times and we are then putting them together into a single data frame. Without preallocation: we can create an empty dataframe (here with the tibble function), iterate over 100 times, take this empty dataframe and combine the rows of it with the rows of the original dataframe, and replace the original object with the new copy and save in output (so we are appending 100 rows every time we iterate!) If we do proper preallocation: we create a list of 100 empty elements, every time we store the results in the list, then we use the bind_rows() functions at the end The first approach does not preallocate by creating an empty space to store the output, the second does. See the difference in time of execution. From 80 milliseconds to less than 3. So you can see how inefficient is not to allocate since most of our data will have more than 100 rows! .panelset[ .panel[.panel-name[Code] ```r # no preallocation mpg_no_preall <- tibble() for(i in 1:100){ mpg_no_preall <- bind_rows(mpg_no_preall, mpg) } # with preallocation using a list mpg_preall <- vector(mode = "list", length = 100) for(i in 1:100){ mpg_preall[[i]] <- mpg } mpg_preall <- bind_rows(mpg_preall) ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" /> ] ] --> --- class: inverse, middle ## Alternatives to for loops in R Writing loops within a dataframe is possible, and sometimes is advisable. However, R provides alternatives to for loops that are *generally* better to use with dataframes: * Iteration with `map_*()` functions * Iteration with `across()` --- ## Map functions <!-- As we have seen in the slides and practice, we kind of have to write a lot of code for a for loop, for example to calculate a straightforward operation like the mean or median etc. so R gives us a shortcut: map functions which come from the purrr package in R and are much more compact (go to the documentation page) --> **In R `for` loops are good, but `map()` functions may be even better!** These functions come from the `purr` package in R: https://purrr.tidyverse.org/ There are **different `map()` functions** each creates a different type of output (this is the same idea as in the `for loop` when we specify the `mode` of our output vector): - `map()` makes a list - `map_lgl()` makes a logical vector - `map_int()` makes an integer vector - `map_dbl()` makes a double vector - `map_chr()` makes a character vector --- ## Map functions Let's see a few examples using the same `penguins_clean` dataset we have been using so far: <!-- data(mtcars) head(mtcars) --> ```r penguins_clean <- select(penguins, 3:6) %>% drop_na() head(penguins_clean, n = 3) ``` ``` ## # A tibble: 3 × 4 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <dbl> <dbl> <int> <int> ## 1 39.1 18.7 181 3750 ## 2 39.5 17.4 186 3800 ## 3 40.3 18 195 3250 ``` --- ## Map functions Pick the appropriate `map()` function and specify at least two main arguments (for more options check the documentation!): * what you are iterating over * what you are calculating ```r map_dbl(penguins_clean, median) ``` ``` ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## 44.45 17.30 197.00 4050.00 ``` ```r map_dbl(penguins_clean, mean, na.rm = TRUE) ``` ``` ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## 43.92193 17.15117 200.91520 4201.75439 ``` --- ## Map functions We can use `map()` functions also with the `%>%` operator: ```r penguins_clean %>% map_dbl(mean, na.rm = TRUE) ``` ``` ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## 43.92193 17.15117 200.91520 4201.75439 ``` --- ## Across function We’ve seen how to use loops and `map()` functions to solve the same task. **Let's review one final approach: the `across()` function.** Notice: * `across()` comes from the `dplyr` package * `map()` functions come from the `purr` package Advantages: * `across()` makes it easy to apply the same transformation to multiple columns in a data frame * since it comes from `dplyr()`, it is well integrated with `dplyr` verbs! --- ### Single column Using the `dplyr` verb `summarize()`, we can easily calculate the mean of one column without the help of `map()` or `across()`: <!-- ``` mtcars %>% summarize(miles_gallon = mean(mpg)) ``` --> ```r penguins_clean %>% summarize(mean_bill_length = mean(bill_length_mm)) ``` ``` ## # A tibble: 1 × 1 ## mean_bill_length ## <dbl> ## 1 43.9 ``` --- ### Multiple columns We can extend the same operation to multiple columns, as follows: <!-- ``` mtcars %>% summarize( miles_gallon = mean(mpg), cylinders = mean(cyl), displacement = mean(disp), horsepower = mean(hp), rear_axle_ratio = mean(drat), weight = mean(wt) ) ``` --> ```r penguins_clean %>% summarize( bill_length = mean(bill_length_mm), bill_depth = mean(bill_depth_mm), flipper_length = mean(flipper_length_mm), body_mass = mean(body_mass_g) ) ``` ``` ## # A tibble: 1 × 4 ## bill_length bill_depth flipper_length body_mass ## <dbl> <dbl> <dbl> <dbl> ## 1 43.9 17.2 201. 4202. ``` This works... but we can do this same operator more efficiently using `across()` --- ### `dplyr::across()` `across()` has two main arguments: * `.cols`: columns you want to operate on; can select columns by position, name, and type * `.fns`: is a function or list of functions to apply to each column We now examine a few examples of `across()` in conjunction with its favorite verb, `summarize()` Check the documentation for more: https://dplyr.tidyverse.org/reference/across.html --- ### `summarize()`, `across()`, and `everything()` .panelset[ .panel[.panel-name[Single function] ```r # calculate the mean of all columns, use everything() to select all variables penguins_clean %>% summarize(across(everything(), ~ mean(., na.rm = TRUE))) ``` ``` ## # A tibble: 1 × 4 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <dbl> <dbl> <dbl> <dbl> ## 1 43.9 17.2 201. 4202. ``` ] .panel[.panel-name[Multiple functions] ```r # to apply multiple summaries, store the functions in a list penguins_clean %>% summarize(across(everything(), .fns = list(min, max))) ``` ``` ## # A tibble: 1 × 8 ## bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2 ## <dbl> <dbl> <dbl> <dbl> ## 1 32.1 59.6 13.1 21.5 ## # ℹ 4 more variables: flipper_length_mm_1 <int>, flipper_length_mm_2 <int>, ## # body_mass_g_1 <int>, body_mass_g_2 <int> ``` ] .panel[.panel-name[Multiple named functions] ```r # provide names to variables, to clearly distinguish each summarized variable penguins_clean %>% summarize(across(everything(), .fns = list(min = min, max = max))) ``` ``` ## # A tibble: 1 × 8 ## bill_length_mm_min bill_length_mm_max bill_depth_mm_min bill_depth_mm_max ## <dbl> <dbl> <dbl> <dbl> ## 1 32.1 59.6 13.1 21.5 ## # ℹ 4 more variables: flipper_length_mm_min <int>, flipper_length_mm_max <int>, ## # body_mass_g_min <int>, body_mass_g_max <int> ``` ] ] <!-- https://willhipson.netlify.app/post/dplyr_across/dplyr_across/ --> --- ## More examples using the [`worldbank` data](https://data.worldbank.org/) ```r data("worldbank", package = "rcis") worldbank ``` ``` ## # A tibble: 78 x 14 ## iso3c date iso2c country perc_en~1 rnd_g~2 percg~3 real_~4 gdp_c~5 top10~6 ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ARG 2005 AR Argentina 89.1 0.379 15.5 6198. 5110. 35 ## 2 ARG 2006 AR Argentina 88.7 0.400 22.1 7388. 5919. 33.9 ## 3 ARG 2007 AR Argentina 89.2 0.402 22.8 8182. 7245. 33.8 ## 4 ARG 2008 AR Argentina 90.7 0.421 21.6 8576. 9021. 32.5 ## 5 ARG 2009 AR Argentina 89.6 0.519 18.9 7904. 8225. 31.4 ## 6 ARG 2010 AR Argentina 89.5 0.518 17.9 8803. 10386. 32 ## 7 ARG 2011 AR Argentina 88.9 0.537 17.9 9528. 12849. 31 ## 8 ARG 2012 AR Argentina 89.0 0.609 16.5 9301. 13083. 29.7 ## 9 ARG 2013 AR Argentina 89.0 0.612 15.3 9367. 13080. 29.4 ## 10 ARG 2014 AR Argentina 87.7 0.613 16.1 8903. 12335. 29.9 ## # ... with 68 more rows, 4 more variables: employment_ratio <dbl>, ## # life_exp <dbl>, pop_growth <dbl>, pop <dbl>, and abbreviated variable names ## # 1: perc_energy_fosfuel, 2: rnd_gdpshare, 3: percgni_adj_gross_savings, ## # 4: real_netinc_percap, 5: gdp_capita, 6: top10perc_incshare ``` --- ### `summarize()`, `across()`, and `where()` .panelset[ .panel[.panel-name[Single condition] ```r # use across() with where() to pick variables based on type (e.g. is.numeric(), etc.) worldbank %>% group_by(country) %>% summarize(across(.cols = where(is.numeric), .fns = mean, na.rm = TRUE)) ``` ``` ## # A tibble: 6 x 11 ## country perc_~1 rnd_g~2 percg~3 real_~4 gdp_c~5 top10~6 emplo~7 life_~8 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Argentina 89.1 0.501 17.5 8560. 10648. 31.6 55.4 75.4 ## 2 China 87.6 1.67 48.3 3661. 5397. 30.8 69.8 74.7 ## 3 Indonesia 65.3 0.0841 30.5 2041. 2881. 31.2 62.5 69.5 ## 4 Norway 58.9 1.60 37.2 70775. 85622. 21.9 67.3 81.3 ## 5 United Kingdom 86.3 1.68 13.5 34542. 43416. 26.2 58.7 80.4 ## 6 United States 84.2 2.69 17.6 42824. 51285. 30.1 60.2 78.4 ## # ... with 2 more variables: pop_growth <dbl>, pop <dbl>, and abbreviated ## # variable names 1: perc_energy_fosfuel, 2: rnd_gdpshare, ## # 3: percgni_adj_gross_savings, 4: real_netinc_percap, 5: gdp_capita, ## # 6: top10perc_incshare, 7: employment_ratio, 8: life_exp ``` ] .panel[.panel-name[Compound condition] ```r # or pick variables based on type and whose name begins with "perc" worldbank %>% group_by(country) %>% summarize(across( .cols = where(is.numeric) & starts_with("perc"), .fn = mean, na.rm = TRUE )) ``` ``` ## # A tibble: 6 x 3 ## country perc_energy_fosfuel percgni_adj_gross_savings ## <chr> <dbl> <dbl> ## 1 Argentina 89.1 17.5 ## 2 China 87.6 48.3 ## 3 Indonesia 65.3 30.5 ## 4 Norway 58.9 37.2 ## 5 United Kingdom 86.3 13.5 ## 6 United States 84.2 17.6 ``` ] ] --- ### `across()` and `filter()` To use `across()` with `filter()`, we need an extra step: `if_any()` or `if_all()` ```r # if_any() keeps rows where the predicate is true for at least one column worldbank %>% filter(if_any(everything(), ~ !is.na(.x))) ``` ```r # if_all() keeps rows where the predicate is true for all selected columns worldbank %>% filter(if_all(everything(), ~ !is.na(.x))) ``` --- class: inverse, middle ## Practice See today's class materials posted on the website --- class: inverse, middle ## While loops * Definition * Examples * Main use --- ## Definition of while loops * A while loop begins by evaluating a condition * If the condition is TRUE, R executes the loop body * Once the loop body is executed, R starts over: the condition is evaluated again, and so forth, until the condition is FALSE * At that point, R stops the while loop --- ### While loop example Syntax: ``` while (condition to be evaluated) { statement(s) } ``` Example: ```r counter <- 1 while(counter <= 4) { print(counter) counter <- counter + 1 } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ``` <!-- set the counter outside the loop, usually to 1 set a condition to be evaluated: here the condition says the counter has to be smaller or equal to 4 if the condition is TRUE, the loop is executed, here we only print(counter) thus the first time it prints 1 but if we leave it as it is (without the last line of code), the while loops will keep going infinitely: we need a way to break the loop thus we increment our counter inside the loop by redefining it as counter + 1 --> --- ### While loop example Let's take the same example, but this time we print `counter` also at the end. Why are the results different? ```r counter <- 1 while(counter <= 3) { print(counter) counter <- counter + 1 print(counter) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 2 ## [1] 3 ## [1] 3 ## [1] 4 ``` --- ### While loop example Let's take the same example, but this time we do not increment our `counter` variable: ``` counter <- 1 while(counter < 3){ print(counter) } ``` What is the output of this code? --- ### While loop example What is the output of this code? ```r counter <- 1 while(counter < 4){ print(counter) multiply <- counter * 100 print(multiply) counter <- counter + 1 print(counter) } ``` ``` ## [1] 1 ## [1] 100 ## [1] 2 ## [1] 2 ## [1] 200 ## [1] 3 ## [1] 3 ## [1] 300 ## [1] 4 ``` --- ### While loops uses While loops are best used **when you do not know the length of your input**: you do not know the exact numbre of times you need to iterate and want to continue until a condition is met For example: * Loop until you get three heads in a row in a random sequence of numbers * Loop until you reach your target number for data collection (e.g. keep accepeting user inputs until you have enough responses from users) While loops require a **"count variable"** to be set outside the loop. While loops are important but **less common than for loops** especially for the types of tasks we do in this course. For this reason, we don’t cover them in-depth. <!-- ## Acknowledgments The content of these slides is derived in part from Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone. -->