class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 8 ] .author[ ### Topics: Base R & Data Structures ] --- class: inverse, middle ## Agenda <!-- TO IMPROVE FOR NEXT TIME NOTES FALL 2024 Add more exercises on base R and tidyverse, how to combine them and how subsettings work in each Use the same question for HW4 Question 3 and see the tutorial you developed which is currently stored in css-materials and under the api materials; could take it and reframe as in-class exercise for today then assign a similar thing for the homework --> * What is Base R? * R Data Types and Data Structures * Using Base R commands to manipulate data structures (indexing, mathematical operations, etc.) <!-- R data structures: * Vectors: super important in R! * Lists * Matrices * Data frames * Arrays Content of today lecture is a little dry/technical but nonetheless is an exciting lecture. If we did it on day 1 probably not that exciting, but now you had R experience so things we have introduced separately or have encountered in your workflow, should click together! --> --- class: inverse, middle ## What is Base R? --- ### What is Base R? **When people say "base R" they generally mean...** * programming techniques (if-else, loops, functions) * data types (numeric, integer, character, logical, factor) * data structures (vectors, matrices, lists, data frames, etc.) * operations (indexing/subsetting, arithmetic operations) <!-- ON DATA STRUCTURES AND DATA TYPES In many cases, the choice of data structure in R determines the data type(s) that can be stored within it. Vectors and matrices typically hold elements of a single data type (e.g., numeric, character). Lists can hold elements of different data types within each list element. TODAY we do the first three points today on programming techniques we have seen some yesterday and we come back to functions on Monday --> --- ### Base R and the tidyverse **Base R** * R as programming language was developed in the early 1990s * includes key commands that are integrated with the `tidyverse` * uses CRAN (Comprehensive R Archive Network): central repository for R packages **Tidyverse** * collection of packages developed after the 2000s (`ggplot2` 2007, `dplyr` 2009, etc.) * powerful for data wrangling and analysis; generates clean and easy to read code * builds on base R and also uses CRAN -- **Our goal:** leverage the `tidyverse` for your daily tasks, but ensure you are familiar with the fundamentals of base R! --- class: inverse, middle ## R Data Types and Data Structures: Overview * We start with an overview of data types and data structures * After that we dive deeper into data structures, especially vectors and lists --- <!-- other data types: factor and date/time sapply(list(x, y, z, j), typeof) To check the data type use `typeof()` or ask directly using `is.numeric()`, `is.double()`, `is.integer()`, `is.character()`, etc. --> ### R Data Types: type of data that an object can hold **Main data types in R:** * Numeric * Double `y <- 4.1` * Integer `x <- 4L` * Character or String `z <- "4.1"` * Logical `j <- TRUE` -- Often the **data structure** we need to use determines the **data type(s)** that can be stored within that data structure... so let's talk about data structures! --- ### R Data Structures: the way data is stored **Main data structures in R:** * (Atomic) Vectors: super important in R! * Matrices * Lists * Data frames: in the `tidyverse` they are called tibbles There are others but those are the most common and important you want to remember. --- ### R Data Structures: organization These data structures can be organized by: * their dimensions: * 1d: vectors * 2d: matrices and data frames * nd: lists and arrays * the data type of their content: * homogeneous (all content must be of the same data type): vectors and matrices * heterogeneous (content can be of different data types): lists and data frames <!-- Let's now see how each of them look like according to this organization. Then we come back to each with more in depth explanation Next slides: review these data structures, how to subset them (taking elements from them), and arithmetic operations we can do with them! we review them all but particular importance on vectors In R everything is a vector... * Review the major types of vectors * Demonstrate how to subset vectors * Demonstrate vector recycling --> --- ### Vector: 1d, only homogenous elements (same type) Define a numeric vector ```r num_vec <- c(1:9) #num_vec <- seq(from = 1, to = 9) #num_vec <- rep(1:9, each = 1) #num_vec <- vector(mode = "numeric", length = 9) ``` Check its data type with `class()` or `typeof()` Check its dimensions with `dim()` R internal storage vs. how we see and understand these objects <!-- MAKE TWO POINTS 1. difference btw class() and typeof(): first more abstract second internal R storage; first is more used 2. why dim() is NULL? When an object has no dimensions set, the dim attribute is null. we understand and talk about vectors as 1d but for the internal storage in R this is null because there are not dimensions set or that you can change is always 1d. If you set the dimensions of an atomic vector, it becomes a matrix and so the dim attribute will no longer be null typeof(): internal storage mode of the object, the underlying data type of the object class(): the higher-level abstraction (e.g. df type is a list, but class is df) that is the class the object belong to --> --- ### Matrix: 2d, only homogeneous elements (same type) ```r m <- matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE) m ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15 ``` Check its data type with `class()` or `typeof()` Check its dimensions with `dim()` --- ### List: nd, can have heterogeneous elements (different types) ```r library(tidyverse) lst <- list( num_vec = c(1:9), mat = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE), another_num_vec = c(1,2,4), char_vec = c("Sabrina", "Mónica"), d = tibble(var_1 = c(1:4), var_2 = c(2:5)) ) ``` Check its data type with `class()` or `typeof()` Check its dimensions with `dim()` and compare with `length()` Things are a bit more complicated when accessing elements with list. More on this in a bit! <!-- Things can get a bit more complicated when accessing elements with lists. Observe this code: ```r dim(lst) # no fixed dimension! ``` ``` ## NULL ``` ```r dim(lst$mat) ``` ``` ## [1] 3 3 ``` ```r dim(lst[[2]]) ``` ``` ## [1] 3 3 ``` Practice: code to check the dimension of the dataframe `d` in this list? dim(lst$d) dim(lst[[5]]) We are using `$` symbol from base R to access elements of an object; it works with every object, but commonly used with lists. Same for `[[]]` --> --- ### Data frame: 2nd, can have heterogeneous elements ```r df <- data.frame( id = 1:3, name = c("Sabrina", "Mónica", "Lucas"), age = c(15, 17, 20) ) df ``` ``` ## id name age ## 1 1 Sabrina 15 ## 2 2 Mónica 17 ## 3 3 Lucas 20 ``` ```r dim(df) ``` ``` ## [1] 3 3 ``` <!-- NOTES FOR NEXT TIME: add here the final slides about dataframe, tibble, etc. or remove them altogheter --> --- ### When to use each data structure? The answer is, "It depends!" and it take time. Here are some guidelines to help you choose the appropriate data structure at this point: * Vectors and Data Frames: most commonly used data structures, both in this course and beyond. * Matrices: Primarily used for matrix algebra operations. * Lists: Although data frames are generally preferred, lists can be useful when you need flexibility and need to store heterogeneous content. --- class: inverse, middle ## Review and expand what we learned so far Download today's in class materials and open the `intro-base-r.R` script --- class: inverse, middle ## A closer look at Data Structures Now that we've seen what each of these data structures looks like, let's delve deeper into some of them: **vectors and lists** --- ### R is a vector-based program So far, we have been using predominantly data frames (technically tibbles!), which are very common when working with social science data. However, data frames are not the most fundamental type of object in R: **vectors are the ultimate building blocks of objects in R...** <!-- * a matrix is made of vectors * a list is made of vectors (a list is still a vector in R but not an atomic one) * a data frame is made of vectors --> --- ### Atomic vectors: logical, integer, double, character When people say "vectors" they usually imply "atomic vectors": the building blocks of R! * numeric vector * integer vector * double vector * character vector * logical vector **All values in an atomic vector must to be of the same type**. --- ### Types of atomic vectors: numeric **Numeric**: can be integer or double (default) ```r integer_vector <- c(1, 5, 3, 4, 12423) double_vector <- c(4.2, 4, 6, 53.2) is.vector(integer_vector) ## [1] TRUE is.atomic(integer_vector) ## [1] TRUE is.integer(integer_vector) ## [1] FALSE # notice the last line of code gives FALSE, why so? # R reads all numeric vectors as double, use L to force it as integer integer_vector <- c(1L, 5L, 3L, 4L, 12423L) is.integer(integer_vector) ## [1] TRUE ``` <!-- NOTES FOR NEXT TIME: add is.integer(integer_vector) which shows FALSE, then do str(integer_vector), then redefine the vector like that integer_vector <- c(1L, 5L, 3L, 4L, 12423L) and try again is.integer(integer_vector) --> --- ### Types of atomic vectors: character **Character**: note you can use single or double quotations, you just need to be consistent ```r character_vector <- c("Scary", "'1,2,3 ready!'", "Halloween", '10/31/2022') is.vector(character_vector) ## [1] TRUE is.atomic(character_vector) ## [1] TRUE is.character(character_vector) ## [1] TRUE ``` --- ### Types of atomic vectors: logical **Logical**: you use it every time you use a conditional test or operation (e.g., when you filter a data frame) ```r logical_vector <- c(TRUE, TRUE, FALSE, TRUE) is.vector(logical_vector) ## [1] TRUE is.atomic(logical_vector) ## [1] TRUE is.logical(logical_vector) ## [1] TRUE ``` <!-- typeof(): internal storage mode of the object, the underlying data type of the object class(): the higher-level abstraction (e.g. df type is a list, but class is df) that is the class the object belong to --> --- ### Types of atomic vectors: logical Example of filtering rows in a dataframe using logical vectors! ```r library(tidyverse) library(palmerpenguins) data("penguins") # with dplyr filter(): does the whole operation for us and gives back a dataframe filtered_penguins <- penguins %>% filter(body_mass_g > 4000) head(filtered_penguins) ``` ``` ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.2 19.6 195 4675 ## 2 Adelie Torgersen 42 20.2 190 4250 ## 3 Adelie Torgersen 34.6 21.1 198 4400 ## 4 Adelie Torgersen 42.5 20.7 197 4500 ## 5 Adelie Torgersen 46 21.5 194 4200 ## 6 Adelie Dream 39.2 21.1 196 4150 ## # ℹ 2 more variables: sex <fct>, year <int> ``` --- We can do the same thing with base R to demonstrate that under the hood (also in the `tidyverse`), R defines a logical vector and applies it to the `penguins` dataframe: ```r # use base R with $ to get the column of interest: gives a logical vector filter_vector <- penguins$body_mass_g > 4000 is.vector(filter_vector) ``` ``` ## [1] TRUE ``` ```r class(filter_vector) ``` ``` ## [1] "logical" ``` ```r # apply this vector to the dataframe using base R and get back a filter dataframe filtered_penguins_2 <- penguins[filter_vector, ] # df[rows, columns] #head(filtered_penguins_2) ``` --- ### Types of atomic vectors: logical Another important thing to remember about logical vectors is that they can take two values: TRUE or FALSE. When you do operations with them TRUE is evaluated as 1 and FALSE is evaluated as 0 ```r # check logical_vector ``` ``` ## [1] TRUE TRUE FALSE TRUE ``` ```r mean(logical_vector, rm.na = TRUE) ``` ``` ## [1] 0.75 ``` ```r sum(logical_vector) ``` ``` ## [1] 3 ``` --- ### Non-Atomic Vectors Let's define a vector with heterogeneous elements and check its data type: ```r mix_vec <- c(1, "two", 3.4) class(mix_vec) ``` ``` ## [1] "character" ``` -- **What happened?** We are back to an atomic vector! **Why?** If we define a vector, R will coerce its elements to a common data type to maintain homogeneity! This is because vectors in R store only elements of the same type (aka are meant to be atomic). For example: * if you mix numeric and character values: numeric values coerced to character * if you mix logical and numeric values: logical values coerced to numeric --- ### Non-Atomic Vectors So what's an example of a non-atomic vector? A list! ```r non_atomic <- list( a = c(1, 2, 3), b = c("zach", "aidan")) is.atomic(non_atomic) ``` ``` ## [1] FALSE ``` ```r is.list(non_atomic) ``` ``` ## [1] TRUE ``` ```r is.vector(non_atomic) ``` ``` ## [1] TRUE ``` --- ### A particular atomic vector: scalar In math a scalar is defined as a single real number. R has no concept of a scalar: **in R, a scalar is simply a vector of length 1** ``` # set up a vector x of length 10 (x <- sample(10)) # add 100 to x using the long way x + c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100) # add 100 to x using the "R" way: vector recycling! x + 100 ``` The second way to add the numbers is more efficient but can also be dangerous... --- ### Vector Recycling When two vectors are involved in an operation, **R repeats the elements of the shorter vector to match the length of the longer vector**. For example, let's define two numeric vectors `x1` and `x2`: ```r # x1 is sequence of numbers from 1 to 2 (x1 <- seq(from = 1, to = 2)) ``` ``` ## [1] 1 2 ``` ```r # x2 is a sequence of numbers from 1 to 10 (x2 <- seq(from = 1, to = 10)) ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` --- ### Vector Recycling If we add `x1` and `x2` together, R will do it, but the result might not be what we expect: ```r (x1 + x2) ``` ``` ## [1] 2 4 4 6 6 8 8 10 10 12 ``` The shorter vector `x1` is duplicated five times in order to match the length of the longer vector `x2.` The same behavior is for other operations like subtraction, multiplication, logical comparison, etc. --- ### Vector Recycling ```r (x1 - x2) ``` ``` ## [1] 0 0 -2 -2 -4 -4 -6 -6 -8 -8 ``` ```r (x1 * x2) ``` ``` ## [1] 1 4 3 8 5 12 7 16 9 20 ``` --- ### Vector Recycling This behavior is called **vector recycling** and happens automatically in R. You need check if this is what you intended to do. If not, extend the length of the shorter vector manually first, then add them up. ```r (x1 <- c(1, 2, rep(0, 7))) ``` ``` ## [1] 1 2 0 0 0 0 0 0 0 ``` ```r (x1 + x2) ``` ``` ## [1] 2 4 3 4 5 6 7 8 9 11 ``` **Note that if the shorter vector is not a multiple of the longer one, R will print a warning message!** --- ### Subsetting vectors: slicing To subset a vector we use the index location of its elements: ```r x <- c("one", "two", "three", "four", "five") ``` ``` # keep the first element x[1] # keep the first through third elements x[c(1, 2, 3)] # long way x[1:3] # shorter x[c(seq(1, 3))] # sequence x[-c(4:5)] # negative indexing (values that you do not want to keep) x[-c(4,5)] # negative indexing x[c(-1,2,3)] # error! do not mix negative and positive subscripts ``` --- ### Lists Lists are another type of vector, but they are non-atomic vector. They differ from atomic vectors in two main ways: 1. **store heterogeneous elements** vs. atomic vector 2. **are structured differently** and are created with the `list()` function, not with the `c()` function ```r x <- list(1, 2, 3) ``` --- ### Lists structure List objects are structured as a list of **independent elements** ```r x <- list(1, 2, 3) str(x) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 2 ## $ : num 3 ``` Here we have a list of length 3, and each of the elements of this list is a numeric atomic vector of length 1. --- ### Lists elements Unlike atomic vectors, lists can contain **multiple data types**, and we can also name each of them: ```r x_named <- list(a = "abc", b = 2, c = c(1, 2, 3)) str(x_named) ``` ``` ## List of 3 ## $ a: chr "abc" ## $ b: num 2 ## $ c: num [1:3] 1 2 3 ``` Here we have a list of length 3, and each of the elements of this list is a different object: we have a character vector of length 1, one numeric vector of length 1, and one numeric vector of length 3. <!-- NOTES FOR NEXT TIME: this is a repetition from previous slides, could either remove or say that --> --- ### Nested lists <!-- NOTES FOR NEXT TIME: add Things can get complex with list... you can also store lists inside a list: **nested list structure**. --> You can also store lists inside a list: **nested list structure**. In this object `z` we have two lists: ```r z <- list(list(1, 2), list(3, 4)) str(z) ``` ``` ## List of 2 ## $ :List of 2 ## ..$ : num 1 ## ..$ : num 2 ## $ :List of 2 ## ..$ : num 3 ## ..$ : num 4 ``` When you interact with API to get data from the web, you might get this type of nested list as output. --- ### Subsetting lists Lists have a more complex structure than vectors, thus subsetting them (e.g., access their elements) also requires more attention The main thing to remember is that you can use both `[]` and `[[]]` to subset lists (vs. you use only `[]` to subset vectors). In lists: * `[]` extracts a sublist (still a list) * `[[]]` extracts the element directly (not a list) **Let's move to today's class materials to get some practice!** --- class: inverse, middle ## Practice subsetting vectors and lists Download today's in-class materials from the website --- ### Dataframes Recall the dataframe at the beginning of the slides ```r df ``` ``` ## id name age ## 1 1 Sabrina 15 ## 2 2 Mónica 17 ## 3 3 Lucas 20 ``` Similarity of data frames with lists and vectors: * a data frame as a whole contains multiple elements of different data types, but all of the same length: so a data frame as a whole can be seen as a list * a data frame is made of individual columns which are all vectors In short: a data frame is a special type of list where each element (or column) of the list is a vector of a specific data type (numeric, character, etc.). This structure allows you to perform vectorized operations and apply functions to each column independently. --- <!-- NOTES FOR NEXT TIME: remove all these slides from here and go directly to practice exercises; take these slides and move them at the top, after you introdue df! --> ### Dataframe show the list-like structure You can access columns (vectors) of a data frame using list indexing, such as `df$column_name` or `df[["column_name"]]` ```r # directly referencing a specific column df$name ``` ``` ## [1] "Sabrina" "Mónica" "Lucas" ``` ```r # access a column by its name stored in a variable df[["name"]] ``` ``` ## [1] "Sabrina" "Mónica" "Lucas" ``` <!-- add one slide or so on this topic (AI answer): Dataframe Column: Vector (e.g., numeric, character, factor, etc.) Dataframe Structure: List of vectors (each column being a separate vector) The [[ ]] syntax accesses the dataframe column as a vector (simplifying the structure down to the column’s vector). By contrast, single brackets df["column_name"] return a subset of the dataframe with the structure intact as a 1-column dataframe. The [[ ]] syntax, therefore, is useful for accessing columns directly as vectors when you want to work with just the values in that column, without keeping the dataframe structure. --> --- ### Tibble: same as data frame! ```r df <- tibble( id = 1:3, name = c("Sabrina", "Mónica", "Lucas"), age = c(25, 30, 35) ) dim(df) ``` ``` ## [1] 3 3 ``` Small differences (e.g., tibbles are from the `tidyverse` and are more memory efficient)