MACS 30500 LECTURE 6

.title[
# MACS 30500 LECTURE 6
]
.author[
### Topics: Importing/exporting data. Relational data with dplyr.
]

---

## Agenda

* HW2: check-in, rubric, etc.

* Importing and Exporting Data in R with `readr`
  * most important functions: `read_csv()` and `write_csv()`
  * related topics: working directory; relative vs. absolute paths; RStudio Projects

* Relational Data with `dplyr`

* HW3: in-class demo

---

## Importing and Exporting Data in R with `readr`

`readr` documentation: https://readr.tidyverse.org/

---

### Importing CSV files

To load data into R we need **importing functions**. There are a number of them depending on the **type of file** we want to import. Refer to "R for Data Science" 2nd Ed. Chaper 7 for details!

The most common importing functions read **comma-separated values** files. Two main versions:

- from **base-R** we have `read.csv()`
- from **[`readr`](https://readr.tidyverse.org/)** we have `read_csv()`

They are similar, but we use `read_csv()` in this course because is more recent, faster, and does not automatically changes data types (e.g., does not convert characters into factors)

Type `?read.csv()` and `?read_csv()` in your Console to further explore.

---

## `read_csv()`

It takes several arguments all listed in the [documentation]( https://readr.tidyverse.org/reference/read_delim.html)

Some of the most commonly used arguments are:

```
read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA"))
```
--

The `file` argument must specified, the other arguments can be left as default:

```
library(readr)

# load data into my local R Studio
read_csv(file = "/Users/Sabrina Nardin/Desktop/testdata.csv")

# load data into my Workbench 
read_csv(file = "/home/nardin/testdata.csv")

# another way to load data if you are not sure where it is located
read_csv(file = file.choose())
```

---

### Practice loading data in R

1. Create a `testdata.csv` file with four columns (id, name, age, food) with different data types and some missing data. Save it on your desktop with a `csv` extension.

1. Open Workbench: upload the file to the server. Skip this step if you are using R on your machine.

1. Look at your current working directory by typing `getwd()` in the console. That's where R looks at files by default.

1. Load the `library(tidyverse)` and then load the data into R using the `read_csv()` function. Make sure to specify the correct path. If you do not provide a path, R looks in your working directory. Here are two examples:
 * `read_csv("/Users/Sabrina Nardin/Desktop/testdata.csv")` 
 * `read_csv("testdata.csv")`

---

### Changing default arguments: Exercise!

In the next slides, we are going to modify some of the most common arguments of the `read_csv()` function.

Let's start by using the function without modifying them, by simply typing `read_csv(file = "testdata.csv")`

What do you notice?

---

### Modify `col_type`

The default is `read_csv(file, col_types = NULL)`. We can change it to manually set the column types, as shown below (two options):

```
# option 1
read_csv(file = "testdata.csv",
         col_types = cols(id = col_integer(),
                          name = col_character(),
                          age = col_integer(),
                          food = col_character()))
# option 2
read_csv("testdata.csv", col_types = ("icic"))
```

Pick one option, and run the code in your Console to re-import the data. What do you notice?

---

### Modify `na`

The default is `read_csv(file, na = c("", "NA"))`. We can change it to add more missing data options, like that:

```
read_csv("testdata.csv", col_types = ("icic"), na = c("", "NA", "na", "None"))
```

What do you notice? Notice you can customize what goes into the vector `c()`

---

### Modify `col_names`

The default is `read_csv(file, col_names = TRUE)`. We can change it to `col_names = FALSE`, like that:

```
read_csv(file = "testdata.csv", col_names = FALSE)

```

What do you notice?

---

### Modify `skip`

The default is `read_csv(file, skip = 0)`. We can change it to `skip = 2` or any to any other integer.

```
read_csv(file = "testdata.csv", skip = 2)

```

What do you notice?

This is useful when the data you are working with have problematic rows. Note that `read.csv()` (with a dot, from base R) does not have this option! If you are trying to skip lines and it is not working, ensure you are using the right function (`read_csv()` from `readr`)

---

### Takehome from this exercise

Importing files correctly is important as it saves time later in the process

Check the function arguments: there are many of them available that can help you accomplish almost anything you might need!

<br>

*Let's clarify a few additional concepts related to importing and exporting data...*

---

### Working directory

The working directory is the folder that R takes as **default directory** every time you try to load a file, script, etc.

To check your current working directory: start a new session of R and type `getwd()`. In Workbench it should be `"/home/your_cnetid"`

---

### Relative Path vs. Absolute Path

When you import a file (e.g., from Workbench "Files" tab) to R, you want to use so-called "relative path" rather than a so-called "full or absolute path"

**Relative path:** relative to the project folder where this project is stored. This the best approach. It works only if the file is located in R default working directory!
```
read_csv(file = "testdata.csv")
```

**Absolute path:** specify the full path 
```
read_csv(file = "/home/nardin/testdata.csv")
```

---

### RStudio Projects

R studio has something called "RStudio Projects" (**`.Rproj`**). R will automatically detect the working directory based on your project.

This ensures portability and a reliable behavior!

For example, every homework and in-class exercise folder that we have been using in this course has a `.Rproj` file. This file helps R to automatically detect the working directory. If you switch between projects, the working directory changes automatically.

---

### Exercise: create an RStudio Project

Step 1: Open RStudio and navigate to the top-left menu. Then File > New Project...

Step 2: Choose New Directory, then select New Project

Step 3: Name your project (e.g., "lecture6-project") and save it.

Step 4: Click Create Project. RStudio will open a new session within your project environment. Done!

Step 5: Let's test it! In your new project, create a new R script or R Markdown document and inside it type and run `getwd()`. What do you notice?

---

### Other `readr` functions to import data

The `readr` package include several functions to load into R almost all possible file formats that you might encounter (when given an option though, choose a `csv` over other formats).

For example:

* **Comma separated csv** use `read_csv()` from the `readr` package
* **Semi column separated csv** use `read_csv2()`from the `readr` package
* **Tab separated files** use `read_tsv()`from the `readr` package
* **RDS** use `readRDS()` or `read_rds()`
* **Excel** use `read_excel()` from the `readxl` package
* **SAS/SPSS/Stata** use the `haven` package (several functions)

Cheat Sheet for `readr`:
**Help > Cheat Sheets > Browse Cheat Sheets**

---

### `haven` and SAS

```r
library(haven)

read_sas(data_file = system.file("examples", "iris.sas7bdat",
  package = "haven"
))
```

```
## # A tibble: 150 × 5
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows
```

---

### `haven` and SPSS

```r
read_sav(file = system.file("examples", "iris.sav",
  package = "haven"
))
```

```
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##           <dbl>       <dbl>        <dbl>       <dbl> <dbl+lbl> 
##  1          5.1         3.5          1.4         0.2 1 [setosa]
##  2          4.9         3            1.4         0.2 1 [setosa]
##  3          4.7         3.2          1.3         0.2 1 [setosa]
##  4          4.6         3.1          1.5         0.2 1 [setosa]
##  5          5           3.6          1.4         0.2 1 [setosa]
##  6          5.4         3.9          1.7         0.4 1 [setosa]
##  7          4.6         3.4          1.4         0.3 1 [setosa]
##  8          5           3.4          1.5         0.2 1 [setosa]
##  9          4.4         2.9          1.4         0.2 1 [setosa]
## 10          4.9         3.1          1.5         0.1 1 [setosa]
## # ℹ 140 more rows
```

---

### `haven` and Stata

```r
read_dta(file = system.file("examples", "iris.dta",
  package = "haven"
))
```

```
## # A tibble: 150 × 5
##    sepallength sepalwidth petallength petalwidth species
##          <dbl>      <dbl>       <dbl>      <dbl> <chr>  
##  1        5.10       3.5         1.40      0.200 setosa 
##  2        4.90       3           1.40      0.200 setosa 
##  3        4.70       3.20        1.30      0.200 setosa 
##  4        4.60       3.10        1.5       0.200 setosa 
##  5        5          3.60        1.40      0.200 setosa 
##  6        5.40       3.90        1.70      0.400 setosa 
##  7        4.60       3.40        1.40      0.300 setosa 
##  8        5          3.40        1.5       0.200 setosa 
##  9        4.40       2.90        1.40      0.200 setosa 
## 10        4.90       3.10        1.5       0.100 setosa 
## # ℹ 140 more rows
```

---

## `write_csv()`

So far we talked about IMPORTING DATA using `readr` and we focused on its most common function `read_csv()`

It turns out that `readr` has also several functions for EXPORTING DATA. The most common is `write_csv()` which **generates csv files from R data frames**

Documentation: https://readr.tidyverse.org/reference/write_delim.html

```
# import
test <- read_csv("testdata.csv", col_types = ("icic"), na = c("", "NA", "na", "None"))

# a bunch of data analysis and visualization code etc.

# export 
write_csv(test, file = "testdata_cleaned.csv")
```

---

## Relational Data with `dplyr`

---

### Definition of relational data

**Relational database**: set of multiple "tables" that are linked based on data common to them. You can think of a "table" as a dataframe.

**These tables provide meaningful insights only when combined together!**

Answers to research questions are not defined by individual rows or columns in a single table; rather, they emerge from the relationships among tables.

<!--
why you want to do so? e.g. store data in different tables?

Data you need for the analysis is not and cannot be stored in one single table
but it is split across tables; usually two but potentially more
-->

---

### Our focus

There are several software and languages to deal with relational databases.

The most common is SQL but that's beyond our course. For more on this, see ["Chapter 21 Databases"](https://r4ds.hadley.nz/databases) from our book.

**R also allows you to join tables by simply using the `dplyr` package. That's our focus!**

<!-- dplyr for relational data focuses on joining tables; it is basically merging data and no really needing sql terminology and concepts (e.g., table, attributes, relations) but there is value in using them, as explained in the assigned readings from today, which is Ch 19 joins from the book

So we do a mini theoretical intro to relational databases logic and terminology and then we move to dplyr
-->

---

### We use the flights example from the readings

`library(nycflights13)` in ["Chapter 19 Joins"](https://r4ds.hadley.nz/joins) from "R for Data Science" 2nd Edition.

We have five tables (e.g., five distinct datasets): 
* `flights` info about flights, identified by multiple variables
* `airlines` each airplane company name, identified by the abbreviated `career` code
* `airports` info about each airport, identified by the `faa` code
* `planes` info about each plane, identified by its `tailnum` number
* `weather` info about the weather at each NYC airport for each hour, identified by various variables

Load the data into R (package already installed on Workbench): `library(nycflights13)`

---

### We use the flights example from the readings

Visual representation of the relations among the 5 tables in `nycflights13`:

To understand diagrams like this, remember that each relation concerns a **pair of tables**.

---

### Formal definitions

A **KEY** of a table is a one or a subset of columns (formally called "attributes"). Two types of keys:

* **PRIMARY KEY**
uniquely identifies an observation in its own table; e.g., `tailnum` is the primary key of the `planes` table; a primary key can be one or multiple columns.

* **FOREIGN KEY**
matches the primary key of another table; e.g., `tailnum` is a foreign key in the `flights` table (it links each flight to a unique plane by matching the tantalum primary key from the planes table

A **RELATION** is defined between **pairs of tables**: primary key + foreign key in another table.

---

## In pratice...  using `dplyr` to work with relational data

### Mutating joins

### Filtering joins

---

### Mutating joins

Let's review the mutating joins first. This group includes:

**inner join**: keeps observations that appear in both tables

**outer join**: keeps observations that appear in at least one of the tables
  * **left join**: keeps all observations in left table
  * **right join**: keeps all observations in right table
  * **full join**: keeps all observations

---

### inner_join()

Keeps observations that appear in both tables (identified by keys, which are the colored numbers here). Unmatched rows are not included in the results.

]

```
inner_join(x, y, by = "key")

# with pipes
x %>% inner_join(y, by = "key")
  
# if two cols do not have same name
inner_join(x, y, by = c("a" = "b")) 
  
```

]

<!-- by convention, x is assigned as the first dataframe or left one, and y as the second or right one;

the by argument specifies that we are joining it based on the key column (which you cannot see from the slide but its the column name of the colored columns in each x and y). Compare this to the left_join() operation which is another form of mutating join
-->

---

### left_join()

Keeps all observations in the left table (x), even if there is not a match in the right table (y).

]

```
left_join(x, y, by = "key")
```

]

---

### right_join()

Keeps all observations in the right table (y), even if there is not a match in the left table (x).

]

```
right_join(x, y, by = "key")
```

]

---

### full_join()

Keeps all observations, matches and non-matches

]

```
full_join(x, y, by = "key")
```

]

---

### Mutating join venn diagram

---

## Filtering joins

Other than mutating joins, `dplyr` has filtering joins;

- **semi_join**: keeps all observations in x that have a match in y
- **anti_join** drops all observations in x that have a match in y

Essentially these function use information from the second data frame (y) to filter observations from the first data frame (x).

---

### semi_join()

Keeps all observations in x that have a match in y. Only keeps columns from the first table you pass in the code (x).

]

```
semi_join(x, y, by = "key")
```

]

---

### anti_join()

Drops all observations in x that have a match in y. Only keeps columns from the first table you pass in the code (x).

]

```
anti_join(x, y, by = "key")
```
]

---

## Practice working with relational data with `dplyr`

Download today's in-class materials from the website (`relational-data.Rmd`)

---

## HW3: in-class demo to reveiw today's concepts

Download today's in-class materials from the website (`scotus-lecture.Rmd`)