class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 12 ] .author[ ### Topics: Functions (cont.). Debugging. ] --- class: inverse, middle ## Agenda **More about functions:** * Practice Part 2 (from last lecture) * Writing more complex functions: example * Functions with more than one argument * Using functions to organize your code * Documenting your function * Functions and `{{ }}` * Practice Part 3 **Debugging:** * Define a computer bug * Practice finding bugs in code * Debugging techniques * Defensive programming (preventing bugs) <!-- NOTES FALL 2024 ON FUNCTIONS revise the new in-class practice exercises for this lecture in functions-bugs folder form css-materials 1. there is an error in code when looping element says v instead than i 2. prompt said without built-in function, but I am using seq_along() or length(); add that as exception or need to determine length of vector in loop first INTRO TO TOPIC DEBUGGING we are going to define what is a bug then we dive in with some practice finding and fixing bugs, I'd like to ask you to do this in groups of 2-3 people this time as it can be fun! then we de-brief and talk about some basic tips for debugging the rest of today is on preventing bugs before they occur that is defensive programming: it can be achieved with style, and condition handling -- other than with experience but that's not something we can fix with a tip --> --- class: inverse, middle ## More about functions! * Writing more complex functions: example * Functions with more than one argument * Use functions to organize your code * Documenting your function * Functions and `{{ }}` * Practice 3 (from last lecture in-class materials) --- ### Writing more complex functions: example Consider this simulated data of 4 columns and 10 observations (example taken from Ch. 21 of the 1st Edition of your book): ```r simulated_df <- tibble( a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10) ) head(simulated_df, n = 3) ``` ``` ## # A tibble: 3 × 4 ## a b c d ## <dbl> <dbl> <dbl> <dbl> ## 1 -1.21 -0.477 0.134 1.10 ## 2 0.277 -0.998 -0.491 -0.476 ## 3 1.08 -0.776 -0.441 -0.709 ``` <!-- rnorm(10) generates 10 random numbers from a standard normal distribution (mean = 0, sd = 1). --> --- ### Writing more complex functions: example To compute the mean for each column, we learned we can use a **for loop**: ```r output <- vector("double", length(simulated_df)) for (i in seq_along(simulated_df)) { output[i] <- mean(simulated_df[[i]]) } output ``` ``` ## [1] -0.3831574 -0.1181707 -0.3879468 -0.7661931 ``` --- ### Writing more complex functions: example If you are going to compute the mean for a column frequently, **put your for loop into a function**. This way you can write the code once, and call the function whenever you need it. -- Put the code from the previous slide into a function and call it `column_mean()` <!-- copy this code and do it interactively on workbench --> ```r column_mean <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[i] <- mean(df[[i]]) } return(output) } column_mean(simulated_df) ``` ``` ## [1] -0.3831574 -0.1181707 -0.3879468 -0.7661931 ``` --- ### Writing more complex functions: example What if, instead of only computing the mean, we also want to include the median and standard deviation? We **could write a function for each**, replacing `mean()` with `median()`, `sd()`, etc. ```r column_median <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[i] <- median(df[[i]]) } return(output) } ``` ```r column_sd <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[i] <- sd(df[[i]]) } return(output) } ``` --- ### Functions with more than one argument But this way we copied/pasted the code multiple times, and the differences among these codes are hard to spot! **How can we rewrite this code more efficiently?** **By setting up a function with more than one argument!** --- ### Functions with more than one argument In this example, the first argument is the dataframe, the second is the operation to perform, which can be another function, like `mean()`, `median()`, `sd()`, etc. ```r column_stats <- function(df, stat) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[i] <- stat(df[[i]]) } return(output) } ``` ```r column_stats(simulated_df, median) ``` ``` ## [1] -0.5555419 -0.4941011 -0.4656169 -0.6053490 ``` ```r column_stats(simulated_df, sd) ``` ``` ## [1] 0.9957875 1.0673376 0.6660013 0.8942458 ``` --- ### Functions as a way to organize your code Pretty cool right?! **But how many things we can put in a single function?** * Ideally, functions should short and do one task. Good habit: break your code into sequential blocks, and write one function for each action (e.g. one function to import data, one to rename variables, one to calculate descriptive statistics, etc.) * You can chain functions, so that the output of one function is the first argument of the next function: this is called **functions composition** --- ### Functions as a way to organize your code Given this vector: ```r v <- c(1:10, NA, NA) v ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 NA NA ``` I want to perform the following two operations, sequentially: remove NAs and calculate the mean; and I set up two functions to perform them. ```r vect_remove_NA <- function(vector){ return(vector[!is.na(vector)]) } ``` ```r vect_mean <- function(vector) { return(mean(vector)) } ``` --- ### Functions as a way to organize your code First, I call the first function, `vect_remove_NA()`, on my vector `v`, and store the output. Then, I pass it to the second function, `vect_mean()`: ```r v_no_missing <- vect_remove_NA(v) v_mean <- vect_mean(v_no_missing) v_mean ``` ``` ## [1] 5.5 ``` I can do the same thing using pipes: ```r v %>% vect_remove_NA() %>% vect_mean() ``` ``` ## [1] 5.5 ``` <!--for more see [here](https://rpubs.com/tjmahr/pipelines_2015) --> --- ### Functions as a way to organize your code **Takeaway: you can do lots of things with functions!** - assign them to variables - pass them as arguments to other functions - return them as the result of a function <!-- https://github.com/annakrystalli/UNAM/blob/master/Functions_in_R.Rmd --> --- ### Codying style: document your function Functions should be well-documented! How? Each function should include the following: * a one-sentence description of the function summarizing what it does * all function's arguments, denoted as `Args:`, with their data types * a description of the return value, denoted by `Returns:` --- ### Codying style: document your function ``` column_stats <- function(df, stat) { # Computes summary statistics for each column in a data frame # Args: # df (tibble): input data # stat (function): a function to compute stats (e.g., mean, median, sd) # Returns: # numeric: a vector with the computed statistics for each column of the df output <- vector("double", length(df)) for (i in seq_along(df)) { output[i] <- stat(df[[i]]) } return(output) } ``` --- **Example from Google's R Style Guide** [here](https://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html#:~:text=Function%20Documentation,-Functions%20should%20contain&text=These%20comments%20should%20consist%20of,value%2C%20denoted%20by%20Returns%3A%20.): ``` CalculateSampleCovariance <- function(x, y, verbose = TRUE) { # Computes the sample covariance between two vectors. # Args: # x: One of two vectors whose sample covariance is to be calculated. # y: The other vector. x and y must have the same length, greater than one, # with no missing values. # verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE. # Returns: # The sample covariance between x and y. n <- length(x) if (n <= 1 || n != length(y)) { stop("Arguments x and y have invalid lengths: ", length(x), " and ", length(y), ".") } if (TRUE %in% is.na(x) || TRUE %in% is.na(y)) { stop(" Arguments x and y must not have missing values.") } covariance <- var(x, y) if (verbose) cat("Covariance = ", round(covariance, 4), ".\n", sep = "") return(covariance) } ``` --- ### Functions and `{{ }}` You can write custom functions and use them together with `dplyr` or `ggplot` functions! You need the following: * syntax to write a custom function * special operators that allows R to evaluate your custom functions correctly: * bang-bang `!!` with `enquo()` (this is the old version) * curly-curly or embracing `{{ }}` (this replaced the former method!) Check out Chapter 25 of our book for more! --- ### Functions and `{{ }}` Example: create a data frame of age and height of 5 individuals: ```r library(dplyr) df <- tibble(age = c(10, 12, 13, 14, 10), height = c(110, 140, 155, 170, 130)) df ``` ``` ## # A tibble: 5 × 2 ## age height ## <dbl> <dbl> ## 1 10 110 ## 2 12 140 ## 3 13 155 ## 4 14 170 ## 5 10 130 ``` --- ### Functions and `{{ }}` Use `summarize()` from `dplyr` to compute mean and standard deviation on the column `age` of this data frame: ```r df %>% summarize(mean_age = mean(age), sd_age = sd(age)) ``` ``` ## # A tibble: 1 × 2 ## mean_age sd_age ## <dbl> <dbl> ## 1 11.8 1.79 ``` --- ### Functions and `{{ }}` Rewrite the previous code using a function and the curly-curly operator: ```r custom_summary <- function(data, col) { data %>% summarize(mean_value = mean({{ col }}), sd_value = sd({{ col }})) } custom_summary(df, age) ``` ``` ## # A tibble: 1 × 2 ## mean_value sd_value ## <dbl> <dbl> ## 1 11.8 1.79 ``` The `{{ }}` operator tells `dplyr` not to treat "col" as the name of the variable, but instead to look inside it for the variable we actually need (here "age") --- ### Functions and `{{ }}` Rewrite it with multiple columns as arguments: ```r custom_summary <- function(data, col1, col2) { results <- list( summarise(data, mean_value = mean({{ col1 }}), sd_value = sd({{ col1 }})), summarise(data, mean_value = mean({{ col2 }}), sd_value = sd({{ col2 }})) ) # Combine output into a single data frame bind_rows(results) } custom_summary(df, age, height) ``` ``` ## # A tibble: 2 × 2 ## mean_value sd_value ## <dbl> <dbl> ## 1 11.8 1.79 ## 2 141 23.0 ``` --- ### Functions and `{{ }}` **When to use the `{{ }}` operator?** In all cases of tidy evaluation. Two main subtypes (see book Ch. 25 for details): * Data Masking: in functions such as `arrange()`, `filter()`, `summarize()` often when you need to pass a variable name; includes also `aes()` from `gglot2` * Tidy Selection: in functions such as `select()`, `relocate()`, `rename()` that select variables <!-- ### Functions and `{{ }}`: additional resources **Using bang-bang `!!`** * Chapter ["7 Tidy evaluation"](https://krlmlr.github.io/tidyprog/tidy-evaluation.html): explanation with examples on how and why to use `!!` with `enquo` * Blog post ["Writing Custom Tidyverse Functions"](https://jonthegeek.com/2018/06/04/writing-custom-tidyverse-functions/): explanation with step-by-step example on how and when to use `!!` with `ensym` and with `enquo` * Another example [here](http://zevross.com/blog/2018/09/11/writing-efficient-and-streamlined-r-code-with-help-from-the-new-rlang-package/) **Using curly-curly `{{ }}`** * Blog post ["Curly-Curly, the successor of Bang-Bang"](https://www.r-bloggers.com/2019/06/curly-curly-the-successor-of-bang-bang/) * Another example [here](https://agstats.io/post/writing-r-functions/) (under "Functions and Tidy Evaluation") --> --- class: inverse, middle ## Practice writing functions. Part 3 ### Download today's class materials from the website. Complete questions 6 and 7 --- class: inverse, middle ## Debugging: * Define a computer bug * Practice finding bugs in code * Debugging * Defensive programming (preventing bugs) Review "Debugging R code" in [What They Forgot to Teach You About R](https://rstats.wtf/debugging-r), for more! --- ### Bugs A **bug** is an error or flaw in a computer program that causes it to stop or to behave in unintended ways (e.g., produce an incorrect result). When we talk about bugs, we refer to **two main types:** * syntactic bugs (such as typos) * bugs resulting from a misunderstanding of R's rules We all encounter both types of bugs! Fixing the latter requires time and experience, but we can address syntactic bugs faster by improving our coding style! -- *Before we dive into debugging tips, let's begin with some practice!* --- class: inverse, middle ## Practice finding bugs in code ### Download today's class materials from the website. --- ### Basic debugging techniques What debugging tips did you learn from the practice exercises? -- 1. Realize that you have a bug 1. Figure out where it is - read your code out loud - use `print()` statements to check each line - take a deep breath! 1. Fix it --- ### The call stack **Often, the actual cause of a problem is not in the line of code we run.** For example, consider these two functions: `f` which processes an input `x`, and `g`, which calls function `f`. In our code, we call function `g` and we get an error, but the problem is actually in function `f`: ```r f <- function(x) x + 1 g <- function(x) f(x) g("a") ``` ``` ## Error in x + 1: non-numeric argument to binary operator ``` We cannot fix function `g`, because the problem does not occur there. We need to fix function `f` which triggers the call sequence! -- Use `traceback()`, which is often shown automatically in RStudio, and read it from bottom to top (the line at the top is where the error occurred) --- class: inverse, middle ## Defensive programming --- ### Defensive programming **Defensive programming** means preventing bugs from occurring in the first place! This involves coding style and condition handling. #### 1. Coding Style: * Notation and names * Syntax (spacing, curly braces, line length, indentation, etc.) * Comments #### 2. Condition handling --- class: inverse, middle ## 1. Defensive programming with **Coding Style** Review the [tidyverse style guide](https://style.tidyverse.org/) We will go over some key concepts from this guide + review general style guidelines we learned in week 1! <!-- do this rather briefly as for the most part no new concepts; so we spend more time on the second tip for defensive programming: condition handling --> --- ### Writing code Programming | Language ------------|---------- Scripts | Essays Sections | Paragraphs Lines Breaks | Sentences Parentheses | Punctuation Functions | Verbs Variables | Nouns --- ### A text with no syntax Text taken from a speech given by Ronald Reagan in 1987, after the spacial Challenger exploded on take off. Reagan's address: https://youtu.be/Qa7icmqgsow "weve grown used to wonders in this century its hard to dazzle us but for 25 years the united states space program has been doing just that weve grown used to the idea of space and perhaps we forget that weve only just begun were still pioneers they the members of the Challenger crew were pioneers and i want to say something to the school children of America who were watching the live coverage of the shuttles takeoff i know it is hard to understand but sometimes painful things like this happen its all part of the process of exploration and discovery its all part of taking a chance and expanding mans horizons the future doesnt belong to the fainthearted it belongs to the brave the challenger crew was pulling us into the future and well continue to follow them the crew of the space shuttle challenger honored us by the manner in which they lived their lives we will never forget them nor the last time we saw them this morning as they prepared for the journey and waved goodbye and slipped the surly bonds of earth to touch the face of god" --- ### A text with syntax "We've grown used to wonders in this century. It's hard to dazzle us. But for 25 years the United States space program has been doing just that. We've grown used to the idea of space, and perhaps we forget that we've only just begun. We're still pioneers. They, the members of the Challenger crew, were pioneers. And I want to say something to the school children of America who were watching the live coverage of the shuttle's takeoff. I know it is hard to understand, but sometimes painful things like this happen. It's all part of the process of exploration and discovery. It's all part of taking a chance and expanding man's horizons. The future doesn't belong to the fainthearted; it belongs to the brave. The Challenger crew was pulling us into the future, and we'll continue to follow them.... The crew of the space shuttle Challenger honoured us by the manner in which they lived their lives. We will never forget them, nor the last time we saw them, this morning, as they prepared for the journey and waved goodbye and 'slipped the surly bonds of earth' to 'touch the face of God.'" --- ### Short and clear object names ```r # optimal (short and use of snake case) day_one_month day_one # not optimal the_first_day_of_the_month dayone DayOneMonth dm1 ``` <!-- ### Do not overwrite objects ```r # examples of wrong code T <- FALSE x <- seq(from = 1, to = 10) mean <- function(x) sum(x) mean(x) ``` --> --- ### Break your lines and use indentation ``` # optimal scdb_vote <- scdb_vote %>% mutate(chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth", "Marshall", "Taney", "Chase", "Waite", "Fuller", "White", "Taft", "Hughes", "Stone", "Vinson", "Warren", "Burger", "Rehnquist", "Roberts"))) # not optimal scdb_vote <- mutate(scdb_vote, chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth", "Marshall", "Taney", "Chase", "Waite", "Fuller", "White", "Taft", "Hughes", "Stone", "Vinson", "Warren", "Burger", "Rehnquist", "Roberts"))) ``` --- ### Calling functions that have the same name Sometimes functions have the same name in different packages. For example the `map()` function is defined both in the `purrr` package (to work with vectors and functions) and `maps` package (to create maps) If you load both, by default R will use the function from the package most recently loaded: ```r library(purrr) library(maps) map() ``` But what if you want to use the `map()` function from the `purrr` package? --- ### Calling functions that have the same name You can be intentional about which function to use with the `::` notation! ```r library(purrr) library(maps) purrr::map() # use map() from the purrr library maps::map() # use map() from the maps library ``` -- Or you can avoid loading a given package, and just load the specific function that you need from it: ```r library(purrr) map() # use map() from the purrr library maps::map() # use map() from the maps library ``` <!-- ### Use functions correctly See assigned reading ["How to read an R help page"](https://socviz.co/appendix.html#how-to-read-an-r-help-page) --> --- ### Auto-formatting in R Studio R Studio helps out with these issues: * option 1: `Code > Reformat Code` (Shift + Cmd/Ctrl + A) * option 2: `Code > Reindent Lines` (Cmd/Ctrl + I) * option 3: install the package [`styler`](http://styler.r-lib.org/) -- Open an R script, and try it out: ``` y<-10 if (y < 20) { x <- "Too low" } else { x <-"Too high"} ``` <!-- * [This code example](/notes/style-guide/#exercise-style-this-code) --> --- class: inverse, middle ## 2. Defensive programming with **Condition handling** <!-- need to teach try except! --> --- ### Three types of conditions to handle Coding style is one way to practice defensive programming and prevent bugs. The second way is condition handling: **set up our code in a way that it tells us if something is problematic.** Three types of conditions: * Errors (break your code) * Warnings * Messages --- ### Errors: Incorrect code or impossible requests in R For example, this `addition()` function takes two arguments and adds them together. The `if-else` condition checks if either `x` or `y` is not a number. If that's TRUE, the `stop()` function triggers a error and notifies the user: ```r addition <- function(x, y) { if (!is_numeric(c(x, y))) { stop("One of your inputs is not a number") } else { return(x + y) } } addition(3, "2") ``` ``` ## Error in addition(3, "2"): One of your inputs is not a number ``` --- ### Errors: Incorrect code or impossible requests in R A function can test for more than one error; but you need to check each of them separately with `if-else` statements. The function stops at the first error it encounters. How to decide which errors to check for: 1. More conditional tests make the function more robust against incorrect uses 1. Think about who is going to use that function, and how frequently 1. Provide documentation to reduce the need for extensive testing -- Try-Except Blocks: Use try-except blocks to manage exceptions (beyond this course). --- ### Warnings: Code runs but may have potential issues For example, this code defines a function that takes as input `x` a probability value (between 0 and 1) and converts it to a natural logarithm value. R will execute this code, but when the function is called with values outside the probability range, it gives a warning that says the result produces a “NaN” value ("Not a Number", impossible to calculate): ```r logit <- function(x) { return(log(x / (1 - x))) } logit(-1) ``` ``` ## Warning in log(x/(1 - x)): NaNs produced ``` ``` ## [1] NaN ``` --- ### Warnings: Code runs but may have potential issues To fix this warning, we can... Option 1: add a condition that signals and triggers an error instead than a warning. For example, if `x` is not between 0 and 1, then stop the code: ```r logit <- function(x) { if (x < 0 | x > 1) { stop('x not between 0 and 1') } else { return(log(x / (1 - x))) } } logit(-1) ``` ``` ## Error in logit(-1): x not between 0 and 1 ``` <!-- ## Warnings Same code of the previous slide, written more compactly: ```r logit <- function(x) { if (x < 0 | x > 1) stop('x not between 0 and 1') log(x / (1 - x)) } logit(-1) ``` ``` ## Error in logit(-1): x not between 0 and 1 ``` Notice here we can write `if` and the condition one the same line without the `{}` and still preserve code legibility of this single `if` statement; we can also remove `return` --> --- ### Warnings: Code runs but may have potential issues Option 2: To avoid stopping the code, we can also fix a warning without triggering an error. For example, we can check if `x` is outside the given range; if so, replace it with a missing value and trigger a warning if `x` is a missing value. ```r logit <- function(x) { x <- if_else(x < 0 | x > 1, NA_real_, x) if (is.na(x)) { warning('x not between 0 and 1') return(log(x / (1 - x))) } } logit(-1) ``` ``` ## Warning in logit(-1): x not between 0 and 1 ``` ``` ## [1] NA ``` --- ### Messages: Informative notices Messages do not indicate that something is wrong, but provide useful info to the user. For example, here we are plotting with `geom_smooth()` which automatically decides which smoothing algorithm to use (default is `gam` based on sample size): ```r ggplot(diamonds, aes(carat, price)) + geom_point() + geom_smooth() ``` ``` ## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' ``` <img src="index_files/figure-html/message_ggplot-1.png" width="35%" style="display: block; margin: auto;" /> --- ### Messages: Informative notices To write a message in your code, do not use the `print()` function, but the `message()` function" ```r demo_message <- function() message("This is a message") demo_message() ``` ``` ## This is a message ``` You can also suppress a message with `suppressMessages()` but this won't work if you create a message using `print()`: ```r suppressMessages(demo_message()) # no output ```