class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 2 ] .author[ ### Topics: Grammar of Graphics and ggplot2. Coding Style in R. ] --- class: inverse, middle # Agenda * Motivation: Why Visualize Data? * The Grammar of Graphics and `ggplot2` * Coding Style in R * Back to the Grammar of Graphics * In-class Practice: Gapminder Data <!-- FALL 2024 ADD: 25 min: Start class with Git/GitHub slides from Lecture 1 and: say we went fast, quick recap + homework 1 workflow demo before you start these slides: say there is a reason why we jump into R tidyverse content right away and do not start form "what is a variable?": more intuitive and fun to learn this first (you can see results right away) and then go back to variables; but for evreyone and especially if you are fully new to programming, I added a short intro to R tutorial in today's class material: take a look and do the practice --> --- class: inverse, middle # Motivation: Why Visualize Data? --- class: center, middle <!-- Consider the following 13 datasets: --> | ID| `\(N\)`| `\(\bar{X}\)`| `\(\bar{Y}\)`| `\(\sigma_{X}\)`| `\(\sigma_{Y}\)`| `\(R\)`| |--:|---:|---------:|---------:|------------:|------------:|----------:| | 1| 142| 54.26610| 47.83472| 16.76983| 26.93974| -0.0641284| | 2| 142| 54.26873| 47.83082| 16.76924| 26.93573| -0.0685864| | 3| 142| 54.26732| 47.83772| 16.76001| 26.93004| -0.0683434| | 4| 142| 54.26327| 47.83225| 16.76514| 26.93540| -0.0644719| | 5| 142| 54.26030| 47.83983| 16.76774| 26.93019| -0.0603414| | 6| 142| 54.26144| 47.83025| 16.76590| 26.93988| -0.0617148| | 7| 142| 54.26881| 47.83545| 16.76670| 26.94000| -0.0685042| | 8| 142| 54.26785| 47.83590| 16.76676| 26.93610| -0.0689797| | 9| 142| 54.26588| 47.83150| 16.76885| 26.93861| -0.0686092| | 10| 142| 54.26734| 47.83955| 16.76896| 26.93027| -0.0629611| | 11| 142| 54.26993| 47.83699| 16.76996| 26.93768| -0.0694456| | 12| 142| 54.26692| 47.83160| 16.77000| 26.93790| -0.0665752| | 13| 142| 54.26015| 47.83972| 16.76996| 26.93000| -0.0655833| --- class: center, middle <div style="font-size: 28px;"> Each of these 13 datasets has nearly the same mean value of x, mean value of y, standard deviation, and correlation coefficient... <br> </br> Let's say we estimate linear regression models for each dataset: we obtain virtually identical coefficients, which might suggest that the relationships are identical! </div> --- class: center, middle But what happens if we actually visualize of each of these 13 datasets before analyzing them? <img src="index_files/figure-html/datasaurus-graph-1.gif" width="80%" style="display: block; margin: auto;" /> --- class: center, middle These 13 datasets give the same linear regression results, yet they are drastically different! <img src="index_files/figure-html/datasaurus-graph-static-1.png" width="70%" style="display: block; margin: auto;" /> -- **Takeaway:** A graph can reveal much; explore and visualize your data before modeling! --- class: inverse, middle # The Grammar of Graphics and `ggplot2` --- ### `ggplot2` We start our R journey by learning `ggplot2`, which is the main package in R for data visualization. Why we do not start from the very basics? And why do we start with `ggplot2`? -- We learn `ggplot2` basics today, and keep using this package throughout the course! You are going to have a good grasp of this package by the end of the course <!-- DISCLAIMER: I am going to use the terms package and library interchangeably and most people do that, but technically in R: a package is ggplot2, dplyr, tidyr, stringr... so a collection of functions, data, and code that we install on R and use! a library is a the directory where the package is installed or a specific function library() to load an installed packages into R --> --- ### Key facts about `ggplot2` **Part of the `tidyverse`**, which is a collection of R packages for data analysis that are well integrated one with another: https://www.tidyverse.org/. We will learn several of these packages in this course -- **Developed by Hadley Wickham**, the author of our main textbook: "R for Data Science" -- **Based on the Grammar of Graphics**, a framework or a way of thinking about visualization, developed by Leland Wilkinson in the late 1990s. Hadley Wickham built `ggplot2` in R using the principles of the Grammar of Graphics! <!-- the number 2 after ggplot it is because is the second iteration of it, the most popular --> --- ### Install and load `ggplot2` **General approach for all packages in R:** You first **install** a package by typing the following code in your Console. Do this *only once per machine*. Everything is already installed for us in Workbench, so you can skip this step. ``` # install only ggplot install.packages("ggplot2") # or install the entire tidyverse install.packages("tidyverse") ``` -- Then you **load** the package into your current R session by typing the following code in your Console or *at the top* of your R script (best practice). Do this *every time you use the package*. ``` # load only ggplot library(ggplot2) # or load the entire tidyverse library(tidyverse) ``` <!-- Notice the "" when you use the install.packages() function: R requires the package name to be a string because install.packages() expects the name of the package, not the object itself. When we use the librar() function does not matter, you can pass the name of the package in "" or the name itself which is more common! --> --- ### Grammar and "Grammar of Graphics" > A **Grammar** can be broadly defined as the complete structure of a language, governed by a set of *rules* (syntax and morphology). Ultimately, our grammar is what enable communication! -- **Applied to R and visualizations...** > A **Grammar of Graphics** is a grammar that makes it possible to create a wide range of graphics! Just like the regular grammar, it is governed by a set of *rules*. --- ### Main components of the Grammar of Graphics <!-- * assigned readings for detailed explanation! * ggplot2 book "Components of the layered grammar" [here](https://ggplot2-book.org/mastery#sec-components) we are going to see them more in depth in a a bit but for now ... --> The Grammar of Graphics defines a plot a combination of: * Layer: creates the object that we see in the plot, composed of five parts * DATA: name of the dataset we want to visualize * GEOM: geometric object, which defines the type of plot * MAPPING: translate variables into so-called aesthetics (x and y, color, size, shape, etc.) * STAT: statistical transformation, defines whether any data transformations are needed * POSITION: defines the positions of elements on a plot <!-- position: useful for handling overlapping points, bars, or other geometric object --> * COORDINATE SYSTEM: to map the positions of objects onto the plot * FACET: to create subsets of the data * SCALE: one for each aesthetic mapping used on a plot <!-- A plot can have multiple layers. For example for the `geom` we can have a scatterplot, which is produced with `geom_point()` and on top of it we can create a smoothed line, which is produced with `geom_smooth()`. But these two layers can have all the other parts (data, mapping, etc.) --> -- *First, we'll translate these elements into code, then we discuss them. For a deeper explanation, review the readings!* --- ### Grammar of Graphics: Code Template Code template with seven key parameters (the bracketed terms), forming the core of the Grammar of Graphics: ``` ggplot(data = <DATA>) + <GEOM>( mapping = aes(<MAPPING>), stat = <STAT>, position = <POSITION>) + <COORDINATE> + <FACET> ``` *Note: this template is a helpful starting point, but it's not the only way to write the code. We will explore alternatives later today. For now, let's demonstrate how to apply this template using the `mpg` cars dataset...* --- ### Filling the code template using the `mpg` cars dataset The `mpg` dataset: One of multiple built-in datasets available in R for teaching purposes (like "iris", "mtcars", "gapminder", "penguins"). Contains data about different car models characteristics, between 1999 and 2008, with 234 observations and 11 variables. --- First, we load the tidyverse into R (we could load only `ggplot2`, but it is less common). Then, we load and explore the `mpg` data. Copy this code into your Console to reproduce the output! <!-- take the code and type it into R workbench explain the code, add mpg[1:6, ] with base R spend some time describing the table: what is a tibble show the tibble reports the datatype of each variable and explain difference between integer and double which are two types of number data: integer are whole numbers without decimals usually with L, while double have decimal points describe variables displ, hwy, class --> ```r library(tidyverse) data(mpg) head(mpg) ``` ``` ## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa… ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa… ## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa… ## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa… ## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa… ## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa… ``` --- ### Filling the code template using the `mpg` data .pull-left[ ``` # template ggplot(data = <DATA>) + <GEOM>( mapping = aes(<MAPPING>), stat = <STAT>, position = <POSITION>) + <COORDINATE> + <FACET> ``` ] .pull-right[ ``` # template filled with mpg data ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, color = class), stat = "identity", position = "identity") + coord_cartesian() + facet_wrap(facets = vars(class), nrow = 1) ``` ] Variables description: * `displ` (dbl): car’s engine size, in liters * `hwy` (int): car’s fuel efficiency on the highway * `class` (chr): categories of cars (e.g., compact, midsize, SUV, etc.) <!-- facet_wrap(~ class, nrow = 1) --> --- class: inverse, middle # Coding Style in R --- ### On Coding Style in R Before continuing with the Grammar of Graphics, let's take a moment to focus on coding style. Below is the same code from the previous slide. Look at it, but this time focus on its style. What do you notice? ``` ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, color = class), stat = "identity", position = "identity") + coord_cartesian() + facet_wrap(facets = vars(class), nrow = 1) ``` <!-- open and close parenthesis spacing new lines difference between = (in functions to assing values to arguments of functions) and <- (assiging a value to a variable!) when to use "" and when not case matters --> --- ### On Coding Style in R Coding Style Resources: * [Chapter 4 "Workflow: code style"](https://r4ds.hadley.nz/workflow-style) from "R for Data Science" 2nd Edition * ["The `tidyverse` style guide"](https://style.tidyverse.org/index.html): check it whenever you are unsure **Why should we care about coding style?** -- Read this article ["Why does coding style matter?"](https://www.smashingmagazine.com/2012/10/why-coding-style-matters/) --- ### On Coding Style in R **Key things to watch for (not an exhaustive list):** * use `<-` for assigning values to variables * use `=` for assigning values to function arguments * follow naming conventions (names should be: descriptive, meaningful, not too long or short, use lowercase and snake_case) * use spaces to improve readability around operators, but not inside parenthesis or brackets * parenthesis control the order of operations and function calls; when you open one, remember to close it! * use indentation and break long lines * comments: use them, but do so parsimoniously --- class: inverse, middle # Back to the Grammar of Graphics --- ### Now, let's continue with the Grammar of Graphics! We have seen what it is, the code template, and provided a brief overview of its components. Now, let's take a closer look at *some* of these components... <!-- Components: * Layers (with five parts: data, mapping, position, stat, geom) * Coordinate System * Facet * Scale --> --- ### Layers One layer is defined by five parts (data, geom, mapping, stat, position). Layers create the objects on a plot. Layers typically are related to one another, sharing common features. For example, you can use `geom_point()` to build a scatterplot and `geom_smooth()` to overlay a regression line on top of it: ``` data(mpg) ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, color = class), stat = "identity", position = "identity") + geom_smooth( mapping = aes(x = displ, y = hwy), method = "lm", se = FALSE) ``` <!-- go over this only if people ask, reserve it for later! ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(method = "lm", se = FALSE) --> --- ### The five parts of a layer: data, mapping, position, stat, geom **Data:** * Defines the source of the information to be visualized. You simply pass the name of the dataset. * Example: See previous slide. -- **Mapping:** * Defines how variables are transformed to "aesthetics" and applied to the plot. At a minimum, one variable is mapped to the x-axis and another to the y-axis; but usually you will have more mappings like color, size, etc., or mappings specific to a particular geometry. * Example: See previous slide. --- ### The five parts of a layer: data, mapping, position, stat, geom **Position:** * Adjusts the position of elements on the plot to avoid overlaps. Commonly used to spread out dense data. * Example 1: See previous slide. Often you use `position = "identity"` because no adjustments are needed; but if you want to change the previous slide you need to check the documentation to see the positions available in `geom_point` (a common one is `position = "jitter"` that spreads out overlapping points). * Example 2: Bar plots frequently use `position = "stack"` or `position = "dodge"` that move the bars to avoid overlaps. <!-- geom_bar(position = position_dodge(width = 0.7)) geom_bar(position = "stack") --> --- ### The five parts of a layer: data, mapping, position, stat, geom **Statistical transformation (stat):** * Transforms the data in some way. A stat is a function that takes a dataset as input and returns a transformed version of it as output. * You can specify the stat explicitly, or the transformation can be done implicitly. * Example 1: A common example that you will encounter when you create a bar graph... `geom_bar()` implicitly (e.g, by default) uses `stat_count()` to transform the raw data before building the plot (counts observations within each category) * Example 2: Often, you don’t need to make a statistical transformation at all (e.g., in a scatterplot you use the raw values). In these cases, the transformation is so-called an "identity transformation" with `stat = "identity"` --- ### The five parts of a layer: data, mapping, position, stat, geom **Geometric objects (geom):** * Determines the type of plot you create. For example, a point geom produces a scatterplot, a line geom produces a line plot, etc. * Each geom can only display certain aesthetics or visual attributes of the geom. For example, a `geom_point()` has position, color, shape, and size aesthetics. * How do you know which `aes` a geom takes? Consult the documentation! For example for `geom_point()`: https://ggplot2.tidyverse.org/reference/geom_point.html --- ### Test our understanding of geom **Practice: What do these codes produce? Run them in your Console and observe the results.** They all should run, but are they all correct? Why yes/no? Look at the color `aes` and the parentheses! ``` ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) ``` ``` ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") ``` ``` ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) ``` --- class: middle ``` ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) ``` CORRECT CODE. This maps the colors of the scatterplot points to the class variable in the `mpg` dataset. It produces a graph with distinct colors for each car class. --- class: middle ``` ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") ``` CORRECT CODE. This code produces a graph with all blue points. Here, the color doesn’t convey info about a variable: we are manually *setting* a color of our choice for the plot rather than *mapping* a color to a variable. Note that `color` is placed outside the `aes()`: it is not part of the mapping, it is an argument of the `geom_point()` function. --- class: middle ``` ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) ``` WRONG CODE. This code produces a graph with default colors. This is wrong because we are using color as a mapping `aes` but we are setting it to the name of a color "blue" VS. a variable in the data (e.g. the variable `class` in our `mpg` data example). --- ### Other elements: Coordinate System & Faceting **Coordinate System:** Maps the position of objects onto the plot plane. The most common is the Cartesian System **Faceting:** Splits the dataset up into subsets, and visualizes each of them in the same plot. In the code, type the variable you want to use to split the data, and say how they should be arranged. Example: ``` ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, color = class), stat = "identity", position = "identity" ) + coord_cartesian() + facet_wrap(facets = vars(class), nrow = 1) # facet_wrap(~ class, nrow = 1) ``` --- ### How to simplify our code template? With defaults! <!-- Remember I previously said that writing code using this template way is a great starting point but probably not your end point? We can rewrite the same code, simplified that way (taking advantage of fact that layers share common features) --> Example 1: we use the example from the previous slides Long code: ``` ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, color = class), stat = "identity", position = "identity") + geom_smooth( mapping = aes(x = displ, y = hwy), method = "lm", se = FALSE) ``` -- Short code: ``` ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(method = "lm", se = FALSE) ``` --- ### How to simplify our code template? With defaults! Example 2: scatterplot between cars' engine size (`displ`) and highway fuel efficiency (`hwy`) Long code: ``` ggplot() + layer( data = mpg, mapping = aes(x = displ, y = hwy), geom = "point", stat = "identity", position = "identity") + scale_y_continuous() + scale_x_continuous() + coord_cartesian() ``` -- Short code: ``` ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point() ``` --- class: inverse, middle # Practice: Gapminder Download today's in-class materials from the website! --- ### Gapminder The `gapminder` dataset: Contains data on various socio-economic indicators for countries around the world over multiple years (1957-2008). It includes information on life expectancy, GDP per capita, and population Gapminder info: https://cran.r-project.org/web/packages/gapminder/readme/README.html and https://www.gapminder.org/ <!-- # Exercise: Gapminder <img src="index_files/figure-html/gapminder-over-time-1.gif" width="80%" style="display: block; margin: auto;" /> # Acknowledgments The content of these slides is derived in part from Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone. -->