MACS 30500 LECTURE 8

---

## Agenda

<!-- TO IMPROVE FOR NEXT TIME NOTES FALL 2024

Add more exercises on base R and tidyverse, how to combine them and how subsettings work in each

Use the same question for HW4 Question 3 and see the tutorial you developed which is currently stored in css-materials and under the api materials; could take it and reframe as in-class exercise for today then assign a similar thing for the homework
-->

* What is Base R?

* R Data Types and Data Structures

* Using Base R commands to manipulate data structures (indexing, mathematical operations, etc.)

---

##  What is Base R?

---

### What is Base R?

**When people say "base R" they generally mean...**

* programming techniques (if-else, loops, functions)

* data types (numeric, integer, character, logical, factor)

* data structures (vectors, matrices, lists, data frames, etc.)

* operations (indexing/subsetting, arithmetic operations)

<!-- 
ON DATA STRUCTURES AND DATA TYPES
In many cases, the choice of data structure in R determines the data type(s) that can be stored within it. 
Vectors and matrices typically hold elements of a single data type (e.g., numeric, character).
Lists can hold elements of different data types within each list element.

TODAY
we do the first three points today
on programming techniques we have seen some yesterday and we come back to functions on Monday  
-->

---

### Base R and the tidyverse

**Base R** 
* R as programming language was developed in the early 1990s 
* includes key commands that are integrated with the `tidyverse`
* uses CRAN (Comprehensive R Archive Network): central repository for R packages

**Tidyverse**
* collection of packages developed after the 2000s (`ggplot2` 2007, `dplyr` 2009, etc.) 
* powerful for data wrangling and analysis; generates clean and easy to read code
* builds on base R and also uses CRAN

**Our goal:** leverage the `tidyverse` for your daily tasks, but ensure you are familiar with the fundamentals of base R!

---

## R Data Types and Data Structures: Overview

* We start with an overview of data types and data structures 
* After that we dive deeper into data structures, especially vectors and lists

---

<!-- 
other data types: factor and date/time 
sapply(list(x, y, z, j), typeof)

To check the data type use `typeof()` or ask directly using `is.numeric()`, `is.double()`, `is.integer()`, `is.character()`, etc.
-->

### R Data Types: type of data that an object can hold

**Main data types in R:**

* Numeric 
  * Double `y <- 4.1` 
  * Integer `x <- 4L`
  
* Character or String  `z <- "4.1"`

* Logical `j <- TRUE`

Often the **data structure** we need to use determines the **data type(s)** that can be stored within that data structure... so let's talk about data structures!

---

### R Data Structures: the way data is stored

**Main data structures in R:**

* (Atomic) Vectors: super important in R!

* Matrices

* Lists

* Data frames: in the `tidyverse` they are called tibbles

There are others but those are the most common and important you want to remember.

---

### R Data Structures: organization

These data structures can be organized by:

* their dimensions: 
  * 1d: vectors
  * 2d: matrices and data frames
  * nd: lists and arrays
  
* the data type of their content:
  * homogeneous (all content must be of the same data type): vectors and matrices
  * heterogeneous (content can be of different data types): lists and data frames

<!--
Let's now see how each of them look like according to this organization. Then we come back to each with more in depth explanation

Next slides: review these data structures, how to subset them (taking elements from them), and arithmetic operations we can do with them!

we review them all but particular importance on vectors 
In R everything is a vector...
* Review the major types of vectors
* Demonstrate how to subset vectors
* Demonstrate vector recycling
-->

---

### Vector: 1d, only homogenous elements (same type)

Define a numeric vector

```r
num_vec <- c(1:9)

#num_vec <- seq(from = 1, to = 9)
#num_vec <- rep(1:9, each = 1)
#num_vec <- vector(mode = "numeric", length = 9)
```

Check its data type with `class()` or `typeof()`

Check its dimensions with `dim()`

R internal storage vs. how we see and understand these objects

<!-- 
MAKE TWO POINTS

1. difference btw class() and typeof(): first more abstract second internal R storage; first is more used

2. why dim() is NULL? When an object has no dimensions set, the dim attribute is null. we understand and talk about vectors as 1d 
but for the internal storage in R this is null because there are not dimensions set or that you can change is always 1d. If you set the dimensions of an atomic vector, it becomes a matrix and so the dim attribute will no longer be null

typeof(): internal storage mode of the object, the underlying data type of the object
class(): the higher-level abstraction (e.g. df type is a list, but class is df) that is the class the object belong to
-->

---

###  Matrix: 2d, only homogeneous elements (same type)

```r
m <- matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)
m
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
```

Check its data type with `class()` or `typeof()`

Check its dimensions with `dim()`

---

### List: nd, can have heterogeneous elements (different types)

```r
library(tidyverse)

lst <- list(
  num_vec = c(1:9),
  mat = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE),
  another_num_vec = c(1,2,4),
  char_vec = c("Sabrina", "Mónica"),
  d = tibble(var_1 = c(1:4),
              var_2 = c(2:5))
)
```

Check its data type with `class()` or `typeof()`

Check its dimensions with `dim()` and compare with `length()`

Things are a bit more complicated when accessing elements with list. More on this in a bit!

<!--
Things can get a bit more complicated when accessing elements with lists. Observe this code:

```r
dim(lst) # no fixed dimension! 
```

```
## NULL
```

```r
dim(lst$mat)
```

```
## [1] 3 3
```

```r
dim(lst[[2]])
```

```
## [1] 3 3
```

Practice: code to check the dimension of the dataframe `d` in this list?
dim(lst$d)
dim(lst[[5]])

We are using `$` symbol from base R to access elements of an object; it works with every object, but commonly used with lists. Same for `[[]]` 
-->

---

### Data frame: 2nd, can have heterogeneous elements

```r
df <- data.frame(
  id = 1:3,
  name = c("Sabrina", "Mónica", "Lucas"),
  age = c(15, 17, 20)
)
df
```

```
##   id    name age
## 1  1 Sabrina  15
## 2  2  Mónica  17
## 3  3   Lucas  20
```

```r
dim(df)
```

```
## [1] 3 3
```

---

### When to use each data structure?

The answer is, "It depends!" and it take time. Here are some guidelines to help you choose the appropriate data structure at this point:

* Vectors and Data Frames: most commonly used data structures, both in this course and beyond.

* Matrices: Primarily used for matrix algebra operations.

* Lists: Although data frames are generally preferred, lists can be useful when you need flexibility and need to store heterogeneous content.

---

Download today's in class materials and open the `intro-base-r.R` script

---

Now that we've seen what each of these data structures looks like, let's delve deeper into some of them: **vectors and lists**

---

### R is a vector-based program

So far, we have been using predominantly data frames (technically tibbles!), which are very common when working with social science data.

However, data frames are not the most fundamental type of object in R: **vectors are the ultimate building blocks of objects in R...**

---

### Atomic vectors: logical, integer, double, character

When people say "vectors" they usually imply "atomic vectors": the building blocks of R!

* numeric vector 
  * integer vector
  * double vector
  
* character vector

* logical vector

**All values in an atomic vector must to be of the same type**.

---

### Types of atomic vectors: numeric

**Numeric**: can be integer or double (default)

```r
integer_vector <- c(1, 5, 3, 4, 12423)

double_vector <- c(4.2, 4, 6, 53.2)

is.vector(integer_vector)
## [1] TRUE
is.atomic(integer_vector)
## [1] TRUE
is.integer(integer_vector) 
## [1] FALSE

# notice the last line of code gives FALSE, why so? 
# R reads all numeric vectors as double, use L to force it as integer
integer_vector <- c(1L, 5L, 3L, 4L, 12423L)
is.integer(integer_vector)
## [1] TRUE
```

---

### Types of atomic vectors: character

**Character**: note you can use single or double quotations, you just need to be consistent

```r
character_vector <- c("Scary", "'1,2,3 ready!'", "Halloween", '10/31/2022')

is.vector(character_vector)
## [1] TRUE
is.atomic(character_vector)
## [1] TRUE
is.character(character_vector)
## [1] TRUE
```

---

### Types of atomic vectors: logical

**Logical**: you use it every time you use a conditional test or operation (e.g., when you filter a data frame)

```r
logical_vector <- c(TRUE, TRUE, FALSE, TRUE)

is.vector(logical_vector)
## [1] TRUE
is.atomic(logical_vector)
## [1] TRUE
is.logical(logical_vector)
## [1] TRUE
```

---

### Types of atomic vectors: logical

Example of filtering rows in a dataframe using logical vectors!

```r
library(tidyverse)
library(palmerpenguins)
data("penguins")

# with dplyr filter(): does the whole operation for us and gives back a dataframe 
filtered_penguins <- penguins %>% filter(body_mass_g > 4000)
head(filtered_penguins)
```

```
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.2          19.6               195        4675
## 2 Adelie  Torgersen           42            20.2               190        4250
## 3 Adelie  Torgersen           34.6          21.1               198        4400
## 4 Adelie  Torgersen           42.5          20.7               197        4500
## 5 Adelie  Torgersen           46            21.5               194        4200
## 6 Adelie  Dream               39.2          21.1               196        4150
## # ℹ 2 more variables: sex <fct>, year <int>
```

---

We can do the same thing with base R to demonstrate that under the hood (also in the `tidyverse`), R defines a logical vector and applies it to the `penguins` dataframe:

```r
# use base R with $ to get the column of interest: gives a logical vector
filter_vector <- penguins$body_mass_g > 4000
is.vector(filter_vector)
```

```
## [1] TRUE
```

```r
class(filter_vector) 
```

```
## [1] "logical"
```

```r
# apply this vector to the dataframe using base R and get back a filter dataframe  
filtered_penguins_2 <- penguins[filter_vector, ]  # df[rows, columns]
#head(filtered_penguins_2)
```

---

### Types of atomic vectors: logical

Another important thing to remember about logical vectors is that they can take two values: TRUE or FALSE. When you do operations with them TRUE is evaluated as 1 and FALSE is evaluated as 0

```r
# check
logical_vector
```

```
## [1]  TRUE  TRUE FALSE  TRUE
```

```r
mean(logical_vector, rm.na = TRUE)
```

```
## [1] 0.75
```

```r
sum(logical_vector)
```

```
## [1] 3
```

---

### Non-Atomic Vectors

Let's define a vector with heterogeneous elements and check its data type:

```r
mix_vec <- c(1, "two", 3.4)
class(mix_vec)
```

```
## [1] "character"
```

**What happened?** We are back to an atomic vector!

**Why?** If we define a vector, R will coerce its elements to a common data type to maintain homogeneity! This is because vectors in R store only elements of the same type (aka are meant to be atomic). For example: 
* if you mix numeric and character values: numeric values coerced to character
* if you mix logical and numeric values: logical values coerced to numeric

---

### Non-Atomic Vectors

So what's an example of a non-atomic vector? A list!

```r
non_atomic <- list(
  a = c(1, 2, 3),             
  b = c("zach", "aidan"))

is.atomic(non_atomic)  
```

```
## [1] FALSE
```

```r
is.list(non_atomic)
```

```
## [1] TRUE
```

```r
is.vector(non_atomic)
```

```
## [1] TRUE
```

---

### A particular atomic vector: scalar

In math a scalar is defined as a single real number. R has no concept of a scalar: **in R, a scalar is simply a vector of length 1**

```
# set up a vector x of length 10
(x <- sample(10))

# add 100 to x using the long way
x + c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)

# add 100 to x using the "R" way: vector recycling!
x + 100
```

The second way to add the numbers is more efficient but can also be dangerous...

---

### Vector Recycling

When two vectors are involved in an operation, **R repeats the elements of the shorter vector to match the length of the longer vector**.

For example, let's define two numeric vectors `x1` and `x2`:

```r
# x1 is sequence of numbers from 1 to 2
(x1 <- seq(from = 1, to = 2))
```

```
## [1] 1 2
```

```r
# x2 is a sequence of numbers from 1 to 10
(x2 <- seq(from = 1, to = 10))
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

---

### Vector Recycling

If we add `x1` and `x2` together, R will do it, but the result might not be what we expect:

```r
(x1 + x2)
```

```
##  [1]  2  4  4  6  6  8  8 10 10 12
```

The shorter vector `x1` is duplicated five times in order to match the length of the longer vector `x2.`

The same behavior is for other operations like subtraction, multiplication, logical comparison, etc.

---

### Vector Recycling

```r
(x1 - x2)
```

```
##  [1]  0  0 -2 -2 -4 -4 -6 -6 -8 -8
```

```r
(x1 * x2)
```

```
##  [1]  1  4  3  8  5 12  7 16  9 20
```

---

### Vector Recycling

This behavior is called **vector recycling** and happens automatically in R. You need check if this is what you intended to do. If not, extend the length of the shorter vector manually first, then add them up.

```r
(x1 <- c(1, 2, rep(0, 7)))
```

```
## [1] 1 2 0 0 0 0 0 0 0
```

```r
(x1 + x2)
```

```
##  [1]  2  4  3  4  5  6  7  8  9 11
```

**Note that if the shorter vector is not a multiple of the longer one, R will print a warning message!**

---

### Subsetting vectors: slicing

To subset a vector we use the index location of its elements:

```r
x <- c("one", "two", "three", "four", "five")
```

```
# keep the first element
x[1]

# keep the first through third elements
x[c(1, 2, 3)]   # long way
x[1:3]          # shorter
x[c(seq(1, 3))] # sequence 
x[-c(4:5)]      # negative indexing (values that you do not want to keep)
x[-c(4,5)]      # negative indexing

x[c(-1,2,3)]   # error! do not mix negative and positive subscripts
```

---

### Lists

Lists are another type of vector, but they are non-atomic vector. They differ from atomic vectors in two main ways:

1. **store heterogeneous elements** vs. atomic vector
2. **are structured differently** and are created with the `list()` function, not with the `c()` function

```r
x <- list(1, 2, 3)
```

---

### Lists structure

List objects are structured as a list of **independent elements**

```r
x <- list(1, 2, 3)
str(x)
```

```
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3
```

Here we have a list of length 3, and each of the elements of this list is a numeric atomic vector of length 1.

---

### Lists elements

Unlike atomic vectors, lists can contain **multiple data types**, and we can also name each of them:

```r
x_named <- list(a = "abc", b = 2, c = c(1, 2, 3))
str(x_named)
```

```
## List of 3
##  $ a: chr "abc"
##  $ b: num 2
##  $ c: num [1:3] 1 2 3
```

Here we have a list of length 3, and each of the elements of this list is a different object: we have a character vector of length 1, one numeric vector of length 1, and one numeric vector of length 3.

---

### Nested lists

You can also store lists inside a list: **nested list structure**.

In this object `z` we have two lists:

```r
z <- list(list(1, 2), list(3, 4))
str(z)
```

```
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4
```

When you interact with API to get data from the web, you might get this type of nested list as output.

---

### Subsetting lists

Lists have a more complex structure than vectors, thus subsetting them (e.g., access their elements) also requires more attention

The main thing to remember is that you can use both `[]` and `[[]]` to subset lists (vs. you use only `[]` to subset vectors).

In lists:

* `[]` extracts a sublist (still a list)
* `[[]]` extracts the element directly (not a list)

**Let's move to today's class materials to get some practice!**

---

## Practice subsetting vectors and lists

Download today's in-class materials from the website
---

### Dataframes

Recall the dataframe at the beginning of the slides

```r
df
```

```
##   id    name age
## 1  1 Sabrina  15
## 2  2  Mónica  17
## 3  3   Lucas  20
```

Similarity of data frames with lists and vectors:

* a data frame as a whole contains multiple elements of different data types, but all of the same length: so a data frame as a whole can be seen as a list
* a data frame is made of individual columns which are all vectors

In short: a data frame is a special type of list where each element (or column) of the list is a vector of a specific data type (numeric, character, etc.). This structure allows you to perform vectorized operations and apply functions to each column independently.

---

### Dataframe show the list-like structure

You can access columns (vectors) of a data frame using list indexing, such as `df$column_name` or `df[["column_name"]]`

```r
# directly referencing a specific column
df$name
```

```
## [1] "Sabrina" "Mónica"  "Lucas"
```

```r
# access a column by its name stored in a variable 
df[["name"]]
```

```
## [1] "Sabrina" "Mónica"  "Lucas"
```

<!-- add one slide or so on this topic (AI answer):

Dataframe Column: Vector (e.g., numeric, character, factor, etc.)

Dataframe Structure: List of vectors (each column being a separate vector)

The [[ ]] syntax accesses the dataframe column as a vector (simplifying the structure down to the column’s vector). By contrast, single brackets df["column_name"] return a subset of the dataframe with the structure intact as a 1-column dataframe.

The [[ ]] syntax, therefore, is useful for accessing columns directly as vectors when you want to work with just the values in that column, without keeping the dataframe structure.

-->

---

### Tibble: same as data frame!

```r
df <- tibble(
  id = 1:3,
  name = c("Sabrina", "Mónica", "Lucas"),
  age = c(25, 30, 35)
)

dim(df)
```

```
## [1] 3 3
```

Small differences (e.g., tibbles are from the `tidyverse` and are more memory efficient)