class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 6 ] .author[ ### Topics: Importing/exporting data. Relational data with dplyr. ] --- class: inverse, middle ## Agenda * HW2: check-in, rubric, etc. * Importing and Exporting Data in R with `readr` * most important functions: `read_csv()` and `write_csv()` * related topics: working directory; relative vs. absolute paths; RStudio Projects * Relational Data with `dplyr` * HW3: in-class demo <!-- FALL 2024 SUGGESTIONS FOR NEXT TIME I ended up moving HW3 to next lecture, I might just do that directly given that there already lots of content in this lecture Also: rename these slides to a more appropriate name that reflects content like import-data-relational-data --> --- class: inverse, middle ## Importing and Exporting Data in R with `readr` `readr` documentation: https://readr.tidyverse.org/ --- ### Importing CSV files To load data into R we need **importing functions**. There are a number of them depending on the **type of file** we want to import. Refer to "R for Data Science" 2nd Ed. Chaper 7 for details! -- The most common importing functions read **comma-separated values** files. Two main versions: - from **base-R** we have `read.csv()` - from **[`readr`](https://readr.tidyverse.org/)** we have `read_csv()` They are similar, but we use `read_csv()` in this course because is more recent, faster, and does not automatically changes data types (e.g., does not convert characters into factors) Type `?read.csv()` and `?read_csv()` in your Console to further explore. <!-- `read.csv` is a special case of `read.table`, while `read_csv` is special case of `read_delim`. Look them up to check the differences -- r difference between using one ? and ?? in searching for doc the ? find the exact match the ?? finds the general match --> --- ## `read_csv()` It takes several arguments all listed in the [documentation]( https://readr.tidyverse.org/reference/read_delim.html) Some of the most commonly used arguments are: ``` read_csv(file, col_names = TRUE, col_types = NULL, na = c("", "NA")) ``` -- The `file` argument must specified, the other arguments can be left as default: ``` library(readr) # load data into my local R Studio read_csv(file = "/Users/Sabrina Nardin/Desktop/testdata.csv") # load data into my Workbench read_csv(file = "/home/nardin/testdata.csv") # another way to load data if you are not sure where it is located read_csv(file = file.choose()) ``` <!-- Make sure the file is located in the given path and you are typing the path correctly. Let's practice! --> --- ### Practice loading data in R 1. Create a `testdata.csv` file with four columns (id, name, age, food) with different data types and some missing data. Save it on your desktop with a `csv` extension. 1. Open Workbench: upload the file to the server. Skip this step if you are using R on your machine. 1. Look at your current working directory by typing `getwd()` in the console. That's where R looks at files by default. 1. Load the `library(tidyverse)` and then load the data into R using the `read_csv()` function. Make sure to specify the correct path. If you do not provide a path, R looks in your working directory. Here are two examples: * `read_csv("/Users/Sabrina Nardin/Desktop/testdata.csv")` * `read_csv("testdata.csv")` --- ### Changing default arguments: Exercise! In the next slides, we are going to modify some of the most common arguments of the `read_csv()` function. Let's start by using the function without modifying them, by simply typing `read_csv(file = "testdata.csv")` What do you notice? <!-- This file is a good example of messy data! type of column is shown at the top, e.g. id is double, name is char, but so is age, which should not be. Why so? the "na" is interpret as a character rather than missing data and all column values are forced to character. --> --- ### Modify `col_type` The default is `read_csv(file, col_types = NULL)`. We can change it to manually set the column types, as shown below (two options): ``` # option 1 read_csv(file = "testdata.csv", col_types = cols(id = col_integer(), name = col_character(), age = col_integer(), food = col_character())) # option 2 read_csv("testdata.csv", col_types = ("icic")) ``` Pick one option, and run the code in your Console to re-import the data. What do you notice? <!-- all columns types have been converted to the datatype we specified. R is also guessing that the na in age is actually missing data and so converts it as such, but we get a warning message; type problems() to see more --> --- ### Modify `na` The default is `read_csv(file, na = c("", "NA"))`. We can change it to add more missing data options, like that: ``` read_csv("testdata.csv", col_types = ("icic"), na = c("", "NA", "na", "None")) ``` What do you notice? Notice you can customize what goes into the vector `c()` <!-- we can enlarge the set of missing data to include everything we want --> --- ### Modify `col_names` The default is `read_csv(file, col_names = TRUE)`. We can change it to `col_names = FALSE`, like that: ``` read_csv(file = "testdata.csv", col_names = FALSE) ``` What do you notice? <!-- first line is not more read as variable names --> --- ### Modify `skip` The default is `read_csv(file, skip = 0)`. We can change it to `skip = 2` or any to any other integer. ``` read_csv(file = "testdata.csv", skip = 2) ``` What do you notice? -- This is useful when the data you are working with have problematic rows. Note that `read.csv()` (with a dot, from base R) does not have this option! If you are trying to skip lines and it is not working, ensure you are using the right function (`read_csv()` from `readr`) --- ### Takehome from this exercise Importing files correctly is important as it saves time later in the process Check the function arguments: there are many of them available that can help you accomplish almost anything you might need! -- <br> *Let's clarify a few additional concepts related to importing and exporting data...* --- ### Working directory The working directory is the folder that R takes as **default directory** every time you try to load a file, script, etc. To check your current working directory: start a new session of R and type `getwd()`. In Workbench it should be `"/home/your_cnetid"` --- ### Relative Path vs. Absolute Path When you import a file (e.g., from Workbench "Files" tab) to R, you want to use so-called "relative path" rather than a so-called "full or absolute path" **Relative path:** relative to the project folder where this project is stored. This the best approach. It works only if the file is located in R default working directory! ``` read_csv(file = "testdata.csv") ``` **Absolute path:** specify the full path ``` read_csv(file = "/home/nardin/testdata.csv") ``` <!-- You can also manually set your directory to an absolute path, for example using `setwd()` but that is not the best approach for reproducible. Use relative paths instead! --> --- ### RStudio Projects R studio has something called "RStudio Projects" (**`.Rproj`**). R will automatically detect the working directory based on your project. This ensures portability and a reliable behavior! For example, every homework and in-class exercise folder that we have been using in this course has a `.Rproj` file. This file helps R to automatically detect the working directory. If you switch between projects, the working directory changes automatically. <!-- for HW3, released tmr, we are asking to load data, we will be giving you a rproj like in HW2, so all you have to do to load teh data is to be in the correct project and then use a relative path with the name of the data.csv --> --- ### Exercise: create an RStudio Project Step 1: Open RStudio and navigate to the top-left menu. Then File > New Project... Step 2: Choose New Directory, then select New Project Step 3: Name your project (e.g., "lecture6-project") and save it. Step 4: Click Create Project. RStudio will open a new session within your project environment. Done! Step 5: Let's test it! In your new project, create a new R script or R Markdown document and inside it type and run `getwd()`. What do you notice? --- ### Other `readr` functions to import data The `readr` package include several functions to load into R almost all possible file formats that you might encounter (when given an option though, choose a `csv` over other formats). For example: * **Comma separated csv** use `read_csv()` from the `readr` package * **Semi column separated csv** use `read_csv2()`from the `readr` package * **Tab separated files** use `read_tsv()`from the `readr` package * **RDS** use `readRDS()` or `read_rds()` * **Excel** use `read_excel()` from the `readxl` package * **SAS/SPSS/Stata** use the `haven` package (several functions) -- Cheat Sheet for `readr`: **Help > Cheat Sheets > Browse Cheat Sheets** --- ### `haven` and SAS ```r library(haven) read_sas(data_file = system.file("examples", "iris.sas7bdat", package = "haven" )) ``` ``` ## # A tibble: 150 × 5 ## Sepal_Length Sepal_Width Petal_Length Petal_Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ℹ 140 more rows ``` --- ### `haven` and SPSS ```r read_sav(file = system.file("examples", "iris.sav", package = "haven" )) ``` ``` ## # A tibble: 150 × 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <dbl+lbl> ## 1 5.1 3.5 1.4 0.2 1 [setosa] ## 2 4.9 3 1.4 0.2 1 [setosa] ## 3 4.7 3.2 1.3 0.2 1 [setosa] ## 4 4.6 3.1 1.5 0.2 1 [setosa] ## 5 5 3.6 1.4 0.2 1 [setosa] ## 6 5.4 3.9 1.7 0.4 1 [setosa] ## 7 4.6 3.4 1.4 0.3 1 [setosa] ## 8 5 3.4 1.5 0.2 1 [setosa] ## 9 4.4 2.9 1.4 0.2 1 [setosa] ## 10 4.9 3.1 1.5 0.1 1 [setosa] ## # ℹ 140 more rows ``` --- ### `haven` and Stata ```r read_dta(file = system.file("examples", "iris.dta", package = "haven" )) ``` ``` ## # A tibble: 150 × 5 ## sepallength sepalwidth petallength petalwidth species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.10 3.5 1.40 0.200 setosa ## 2 4.90 3 1.40 0.200 setosa ## 3 4.70 3.20 1.30 0.200 setosa ## 4 4.60 3.10 1.5 0.200 setosa ## 5 5 3.60 1.40 0.200 setosa ## 6 5.40 3.90 1.70 0.400 setosa ## 7 4.60 3.40 1.40 0.300 setosa ## 8 5 3.40 1.5 0.200 setosa ## 9 4.40 2.90 1.40 0.200 setosa ## 10 4.90 3.10 1.5 0.100 setosa ## # ℹ 140 more rows ``` --- ## `write_csv()` So far we talked about IMPORTING DATA using `readr` and we focused on its most common function `read_csv()` It turns out that `readr` has also several functions for EXPORTING DATA. The most common is `write_csv()` which **generates csv files from R data frames** Documentation: https://readr.tidyverse.org/reference/write_delim.html -- ``` # import test <- read_csv("testdata.csv", col_types = ("icic"), na = c("", "NA", "na", "None")) # a bunch of data analysis and visualization code etc. # export write_csv(test, file = "testdata_cleaned.csv") ``` --- class: inverse, middle ## Relational Data with `dplyr` --- ### Definition of relational data **Relational database**: set of multiple "tables" that are linked based on data common to them. You can think of a "table" as a dataframe. -- **These tables provide meaningful insights only when combined together!** Answers to research questions are not defined by individual rows or columns in a single table; rather, they emerge from the relationships among tables. <!-- why you want to do so? e.g. store data in different tables? Data you need for the analysis is not and cannot be stored in one single table but it is split across tables; usually two but potentially more --> --- ### Our focus There are several software and languages to deal with relational databases. The most common is SQL but that's beyond our course. For more on this, see ["Chapter 21 Databases"](https://r4ds.hadley.nz/databases) from our book. **R also allows you to join tables by simply using the `dplyr` package. That's our focus!** <!-- dplyr for relational data focuses on joining tables; it is basically merging data and no really needing sql terminology and concepts (e.g., table, attributes, relations) but there is value in using them, as explained in the assigned readings from today, which is Ch 19 joins from the book So we do a mini theoretical intro to relational databases logic and terminology and then we move to dplyr --> --- ### We use the flights example from the readings `library(nycflights13)` in ["Chapter 19 Joins"](https://r4ds.hadley.nz/joins) from "R for Data Science" 2nd Edition. We have five tables (e.g., five distinct datasets): * `flights` info about flights, identified by multiple variables * `airlines` each airplane company name, identified by the abbreviated `career` code * `airports` info about each airport, identified by the `faa` code * `planes` info about each plane, identified by its `tailnum` number * `weather` info about the weather at each NYC airport for each hour, identified by various variables -- Load the data into R (package already installed on Workbench): `library(nycflights13)` --- ### We use the flights example from the readings Visual representation of the relations among the 5 tables in `nycflights13`: <img src="relational-nycflights.png" width="60%" style="display: block; margin: auto;" /> To understand diagrams like this, remember that each relation concerns a **pair of tables**. --- ### Formal definitions A **KEY** of a table is a one or a subset of columns (formally called "attributes"). Two types of keys: * **PRIMARY KEY** uniquely identifies an observation in its own table; e.g., `tailnum` is the primary key of the `planes` table; a primary key can be one or multiple columns. * **FOREIGN KEY** matches the primary key of another table; e.g., `tailnum` is a foreign key in the `flights` table (it links each flight to a unique plane by matching the tantalum primary key from the planes table <!-- A variable can be both a primary key and a foreign key. For example, origin is part of the weather primary key, and is also a foreign key for the airports table. --> -- A **RELATION** is defined between **pairs of tables**: primary key + foreign key in another table. <!-- Relations can be * one-to-one * one-to-many: each flight has one plane, but each plane has many flights * many-to-many --> --- class: inverse, middle ## In pratice... using `dplyr` to work with relational data ### Mutating joins ### Filtering joins --- ### Mutating joins Let's review the mutating joins first. This group includes: **inner join**: keeps observations that appear in both tables **outer join**: keeps observations that appear in at least one of the tables * **left join**: keeps all observations in left table * **right join**: keeps all observations in right table * **full join**: keeps all observations --- ### inner_join() Keeps observations that appear in both tables (identified by keys, which are the colored numbers here). Unmatched rows are not included in the results. .pull-left[ <img src="join-setup.png" width="50%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="join-inner.png" width="110%" style="display: block; margin: auto;" /> ``` inner_join(x, y, by = "key") # with pipes x %>% inner_join(y, by = "key") # if two cols do not have same name inner_join(x, y, by = c("a" = "b")) ``` ] <!-- by convention, x is assigned as the first dataframe or left one, and y as the second or right one; the by argument specifies that we are joining it based on the key column (which you cannot see from the slide but its the column name of the colored columns in each x and y). Compare this to the left_join() operation which is another form of mutating join --> --- ### left_join() Keeps all observations in the left table (x), even if there is not a match in the right table (y). .pull-left[ <img src="join-setup.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="join-outer-left.png" width="100%" style="display: block; margin: auto;" /> ``` left_join(x, y, by = "key") ``` ] --- ### right_join() Keeps all observations in the right table (y), even if there is not a match in the left table (x). .pull-left[ <img src="join-setup.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="join-outer-right.png" width="100%" style="display: block; margin: auto;" /> ``` right_join(x, y, by = "key") ``` ] <!-- same thing as left join but reversing the order of the data frame or table typically right join is utilized less because by convention we think at the left or x table as the primary data for these kind of operations --> --- ### full_join() Keeps all observations, matches and non-matches <!-- more missing values --> .pull-left[ <img src="join-setup.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="join-outer-full.png" width="100%" style="display: block; margin: auto;" /> ``` full_join(x, y, by = "key") ``` ] --- ### Mutating join venn diagram <img src="join-venn.png" width="60%" style="display: block; margin: auto;" /> --- ## Filtering joins Other than mutating joins, `dplyr` has filtering joins; - **semi_join**: keeps all observations in x that have a match in y - **anti_join** drops all observations in x that have a match in y Essentially these function use information from the second data frame (y) to filter observations from the first data frame (x). --- ### semi_join() Keeps all observations in x that have a match in y. Only keeps columns from the first table you pass in the code (x). .pull-left[ <img src="join-setup.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="join-semi.png" width="100%" style="display: block; margin: auto;" /> ``` semi_join(x, y, by = "key") ``` ] --- ### anti_join() Drops all observations in x that have a match in y. Only keeps columns from the first table you pass in the code (x). .pull-left[ <img src="join-setup.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="join-anti.png" width="100%" style="display: block; margin: auto;" /> ``` anti_join(x, y, by = "key") ``` ] --- class: inverse, middle ## Practice working with relational data with `dplyr` Download today's in-class materials from the website (`relational-data.Rmd`) --- class: inverse, middle ## HW3: in-class demo to reveiw today's concepts Download today's in-class materials from the website (`scotus-lecture.Rmd`)