MACS 30500 LECTURE 4
Topics: Exploratory vs. Confirmatory Data Analysis. Using Graphs for Data Analysis.
1 / 17

Agenda

Exploratory vs. Confirmatory Data Analysis
Using Graphs for Data Analysis (practice with the scorecard data):
- Display variation & co-variation
- Match type of plot to variable type!

2 / 17

Exploratory vs. Confirmatory Data Analysis3 / 17

Exploratory vs. Confirmatory Data Analysis

Exploratory Data Analysis (EDA)

All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses.

Confirmatory Data Analysis (CDA)

Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling)

4 / 17

Exploratory vs. Confirmatory Data Analysis

Exploratory Data Analysis (EDA)

All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses.

Confirmatory Data Analysis (CDA)

Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling)

Today, and in this course, we mostly focus on EDA!

4 / 17

EDA as Iterative and Creative Process

EDA is an Iterative Process:

Generate exploratory questions about your data
Search for answers in the data
Use what you learn to refine your questions and/or generate new questions
Repeat until necessary

5 / 17

EDA as Iterative and Creative Process

EDA is an Iterative Process:

Generate exploratory questions about your data
Search for answers in the data
Use what you learn to refine your questions and/or generate new questions
Repeat until necessary

EDA is a also a Creative Process:

EDA is not an exact science, and requires curiosity about the data, intuition, and patience. At the most basic level, it involves answering two questions: how values within a single variable vary? how values of two variables co-vary?

5 / 17

EDA relies on...

Descriptive stats such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation).

Visualizations such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display:

Variation: how values within a single variable vary (univariate analysis)
Covariation: how values of two variables co-vary (bivariate analysis)

6 / 17

EDA relies on...

Descriptive stats such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation).

Visualizations such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display:

Variation: how values within a single variable vary (univariate analysis)
Covariation: how values of two variables co-vary (bivariate analysis)

Visualizations are essential tools in both EDA and CDA, each with distinct purposes. Even within EDA, we can utilize visualizations in multiple ways. Let’s look at an example with the penguins data to illustrate this point.

6 / 17

Example: First Plot

library(palmerpenguins)
data("penguins")
head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

TASK: build a plot of two continuous variables (penguins body mass and flipper length).

7 / 17

First Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  )

7 / 17

First Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point()

7 / 17

First Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point() +
  geom_smooth()

7 / 17

First Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point() +
  geom_smooth()

Once you draw such initial plot, ask yourself:

Substantive questions: What does this graph tell us? Are there patterns? Outliers? What hypotheses can we generate from it? Is the chosen plot appropriate here?
Stylistic questions: What are the strengths and limitations of this quick visualization? How could we improve it?

7 / 17

Refined Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  )

7 / 17

Refined Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point(alpha = .1)

7 / 17

Refined Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point(alpha = .1) +
  geom_smooth(method = "lm",
              se = FALSE)

7 / 17

Refined Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point(alpha = .1) +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(
    title = "Relationship between body mass and\nflipper length of a penguin",
    subtitle = "Sample of 344 penguins",
    x = "Body mass (g)",
    y = "Flipper length (mm)"
  )

7 / 17

Refined Plot

ggplot(
  data = penguins,
  mapping = aes(
    x = body_mass_g,
    y = flipper_length_mm)
  ) +
  geom_point(alpha = .1) +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(
    title = "Relationship between body mass and\nflipper length of a penguin",
    subtitle = "Sample of 344 penguins",
    x = "Body mass (g)",
    y = "Flipper length (mm)"
  ) +
  theme_xaringan(
    title_font_size = 18,
    text_font_size = 16
  )

7 / 17

Takeaway

In this course, our focus is primarily on Exploratory Data Analysis (EDA); we won’t delve into Confirmatory Data Analysis or formal hypothesis testing.

Still, your approach to plotting should follow this sequence:

Begin by creating several quick plots to explore the data:
- Pros: save coding time and focus on observing the data.
- Avoid adjusting stylistic components at this stage; instead, refine by removing outliers, changing graph types, wrangle data, etc. until you’re satisfied with the plot.
Once you have settled on a plot:
- Improve its visual appeal by refining the code. Add elements such as labels, legends, color adjustments, scales, themes, facets, etc.
- Use this polished version for reports, presentations, and your final plots for the assignments in this course!

8 / 17

Using Graphs for Data AnalysisWe use the scorecard data to practice using graphs for data analysis, and specifically:Display variation & co-variation 
Match type of plot to variable type!
9 / 17

The `scorecard` dataset

library(tidyverse)
library(rcis)
data("scorecard")

The Department of Education collects annual statistics on colleges and universities in the United States. Data include: universities names, state, type, admission rate, costs, etc.

We are going to look at a subset of this data, from 2018-19.

10 / 17

The `scorecard` dataset

glimpse(scorecard)

## Rows: 1,732
## Columns: 14
## $ unitid    <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
## $ name      <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ state     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ type      <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
## $ admrate   <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53…
## $ satavg    <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076…
## $ cost      <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431…
## $ netcost   <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071…
## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400…
## $ pctpell   <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23…
## $ comprate  <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69…
## $ firstgen  <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
## $ debt      <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425…
## $ locale    <fct> City, City, City, City, City, City, City, City, City, Suburb…

11 / 17

Types of Visualizations and Best Graph Types

Do I want to represent variations within...

Single variable
Two variables
Three variables

Is or are my variables...

Continuous
Categorical
Other types (discrete, ordinal, nominal, etc.)

12 / 17

Types of Visualizations and Best Graph Types

Single variable (Univariate Analysis) to display how values within a single variable vary

One continuous variable: histogram
One categorical variable: bar plot

Two variables (Bivariate Analysis) to display how they co-vary

Two continuous variables: scatter plot
Two categorical variables: (grouped or stacked) bar plot; dot or mosaic plot
One categorical and one continuous variable: box plot; faceted histogram

Three variables (Multivariate Analysis)

One categorical and two continuous variables: faceted scatterplot; scatterplot with colors
One continuous and two categorical variables: box plot grouped by categorical variables

13 / 17

Practice!

The next slide lists a set of tasks. In small groups, use the scorecard dataset (refer to the code on previous slides to load it) to create the most suitable graph for each task. Afterward, we’ll regroup to share code and discuss.

Before plotting: Consider the type of variable and the type of variation you need to represent.
While plotting: Keep it simple, as you would for an initial EDA. There’s no need to add titles, axis labels, etc. for this exercise.
After plotting: Stare at the graph... look for patterns, outliers, or any notable features, and substantively interpret the graph.

14 / 17

Practice!

TASK 1: Display the annual total cost of school attendance across the U.S. Hint: only one variable (cost)

TASK 2: Display the total number of schools in the U.S. by school type. Hint: only one variable (type)

TASK 3: Display the annual total cost and net cost of attendance to schools in the U.S.

TASK 4: Display the total number of schools in the U.S. by school type (n = 3) and by state (n = 54). Note: the initial graph you generate here may lack visual appeal. Focus on identifying potential improvements rather than implementing them for now.

TASK 5: Display the annual total cost of attendance by school type (variables cost and type)

TASK 6: Display the annual total cost of attendance and net cost of attendance by school type (variables cost, netcost, type)

15 / 17

Discussion

File for sharing solutions: https://codeshare.io/vAzK44

Download today’s class materials from our website for further insights into these tasks and for additional practice exercises!

16 / 17

Further insights on these tasks, and more exercises using this datasets are in today's class materials (downloadable from the website)17 / 17

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

MACS 30500 LECTURE 4

Topics: Exploratory vs. Confirmatory Data Analysis. Using Graphs for Data Analysis.

Agenda

Exploratory vs. Confirmatory Data Analysis

Exploratory vs. Confirmatory Data Analysis

Exploratory vs. Confirmatory Data Analysis

EDA as Iterative and Creative Process

EDA as Iterative and Creative Process

EDA relies on...

EDA relies on...

Example: First Plot

First Plot

First Plot

First Plot

First Plot

Refined Plot

Refined Plot

Refined Plot

Refined Plot

Refined Plot

Takeaway

Using Graphs for Data Analysis

We use the `scorecard` data to practice using graphs for data analysis, and specifically:

The `scorecard` dataset

The `scorecard` dataset

Types of Visualizations and Best Graph Types

Types of Visualizations and Best Graph Types

Practice!

Practice!

Discussion

Further insights on these tasks, and more exercises using this datasets are in today's class materials (downloadable from the website)

Agenda

Help

MACS 30500 LECTURE 4

Topics: Exploratory vs. Confirmatory Data Analysis. Using Graphs for Data Analysis.

Agenda

Exploratory vs. Confirmatory Data Analysis

Exploratory vs. Confirmatory Data Analysis

Exploratory vs. Confirmatory Data Analysis

EDA as Iterative and Creative Process

EDA as Iterative and Creative Process

EDA relies on...

EDA relies on...

Example: First Plot

First Plot

First Plot

First Plot

First Plot

Refined Plot

Refined Plot

Refined Plot

Refined Plot

Refined Plot

Takeaway

Using Graphs for Data Analysis

We use the scorecard data to practice using graphs for data analysis, and specifically:

The scorecard dataset

The scorecard dataset

Types of Visualizations and Best Graph Types

Types of Visualizations and Best Graph Types

Practice!

Practice!

Discussion

Further insights on these tasks, and more exercises using this datasets are in today's class materials (downloadable from the website)

Agenda

Help

We use the `scorecard` data to practice using graphs for data analysis, and specifically:

The `scorecard` dataset

The `scorecard` dataset