+ - 0:00:00
Notes for current slide
Notes for next slide

MACS 30500 LECTURE 4

Topics: Exploratory vs. Confirmatory Data Analysis. Using Graphs for Data Analysis.

1 / 17

Agenda

  • Exploratory vs. Confirmatory Data Analysis

  • Using Graphs for Data Analysis (practice with the scorecard data):

    • Display variation & co-variation
    • Match type of plot to variable type!
2 / 17

Exploratory vs. Confirmatory Data Analysis

3 / 17

Exploratory vs. Confirmatory Data Analysis

Exploratory Data Analysis (EDA)

All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses.

Confirmatory Data Analysis (CDA)

Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling)

4 / 17

Exploratory vs. Confirmatory Data Analysis

Exploratory Data Analysis (EDA)

All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses.

Confirmatory Data Analysis (CDA)

Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling)

Today, and in this course, we mostly focus on EDA!

4 / 17

EDA as Iterative and Creative Process

EDA is an Iterative Process:

  1. Generate exploratory questions about your data
  2. Search for answers in the data
  3. Use what you learn to refine your questions and/or generate new questions
  4. Repeat until necessary
5 / 17

EDA as Iterative and Creative Process

EDA is an Iterative Process:

  1. Generate exploratory questions about your data
  2. Search for answers in the data
  3. Use what you learn to refine your questions and/or generate new questions
  4. Repeat until necessary

EDA is a also a Creative Process:

EDA is not an exact science, and requires curiosity about the data, intuition, and patience. At the most basic level, it involves answering two questions: how values within a single variable vary? how values of two variables co-vary?

5 / 17

EDA relies on...

Descriptive stats such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation).

Visualizations such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display:

  • Variation: how values within a single variable vary (univariate analysis)
  • Covariation: how values of two variables co-vary (bivariate analysis)
6 / 17

EDA relies on...

Descriptive stats such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation).

Visualizations such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display:

  • Variation: how values within a single variable vary (univariate analysis)
  • Covariation: how values of two variables co-vary (bivariate analysis)

Visualizations are essential tools in both EDA and CDA, each with distinct purposes. Even within EDA, we can utilize visualizations in multiple ways. Let’s look at an example with the penguins data to illustrate this point.

6 / 17

Example: First Plot

library(palmerpenguins)
data("penguins")
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>

TASK: build a plot of two continuous variables (penguins body mass and flipper length).

7 / 17

First Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
)

7 / 17

First Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point()

7 / 17

First Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point() +
geom_smooth()

7 / 17

First Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point() +
geom_smooth()

Once you draw such initial plot, ask yourself:

  • Substantive questions: What does this graph tell us? Are there patterns? Outliers? What hypotheses can we generate from it? Is the chosen plot appropriate here?
  • Stylistic questions: What are the strengths and limitations of this quick visualization? How could we improve it?
7 / 17

Refined Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
)

7 / 17

Refined Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point(alpha = .1)

7 / 17

Refined Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point(alpha = .1) +
geom_smooth(method = "lm",
se = FALSE)

7 / 17

Refined Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point(alpha = .1) +
geom_smooth(method = "lm",
se = FALSE) +
labs(
title = "Relationship between body mass and\nflipper length of a penguin",
subtitle = "Sample of 344 penguins",
x = "Body mass (g)",
y = "Flipper length (mm)"
)

7 / 17

Refined Plot

ggplot(
data = penguins,
mapping = aes(
x = body_mass_g,
y = flipper_length_mm)
) +
geom_point(alpha = .1) +
geom_smooth(method = "lm",
se = FALSE) +
labs(
title = "Relationship between body mass and\nflipper length of a penguin",
subtitle = "Sample of 344 penguins",
x = "Body mass (g)",
y = "Flipper length (mm)"
) +
theme_xaringan(
title_font_size = 18,
text_font_size = 16
)

7 / 17

Takeaway

In this course, our focus is primarily on Exploratory Data Analysis (EDA); we won’t delve into Confirmatory Data Analysis or formal hypothesis testing.

Still, your approach to plotting should follow this sequence:

  • Begin by creating several quick plots to explore the data:

    • Pros: save coding time and focus on observing the data.
    • Avoid adjusting stylistic components at this stage; instead, refine by removing outliers, changing graph types, wrangle data, etc. until you’re satisfied with the plot.
  • Once you have settled on a plot:

    • Improve its visual appeal by refining the code. Add elements such as labels, legends, color adjustments, scales, themes, facets, etc.
    • Use this polished version for reports, presentations, and your final plots for the assignments in this course!
8 / 17

Using Graphs for Data Analysis

We use the scorecard data to practice using graphs for data analysis, and specifically:

  • Display variation & co-variation
  • Match type of plot to variable type!
9 / 17

The scorecard dataset

library(tidyverse)
library(rcis)
data("scorecard")

The Department of Education collects annual statistics on colleges and universities in the United States. Data include: universities names, state, type, admission rate, costs, etc.

We are going to look at a subset of this data, from 2018-19.

10 / 17

The scorecard dataset

glimpse(scorecard)
## Rows: 1,732
## Columns: 14
## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
## $ admrate <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53…
## $ satavg <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076…
## $ cost <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431…
## $ netcost <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071…
## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400…
## $ pctpell <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23…
## $ comprate <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69…
## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
## $ debt <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425…
## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb…
11 / 17

Types of Visualizations and Best Graph Types

Do I want to represent variations within...

  • Single variable
  • Two variables
  • Three variables

Is or are my variables...

  • Continuous
  • Categorical
  • Other types (discrete, ordinal, nominal, etc.)
12 / 17

Types of Visualizations and Best Graph Types

Single variable (Univariate Analysis) to display how values within a single variable vary

  • One continuous variable: histogram
  • One categorical variable: bar plot

Two variables (Bivariate Analysis) to display how they co-vary

  • Two continuous variables: scatter plot
  • Two categorical variables: (grouped or stacked) bar plot; dot or mosaic plot
  • One categorical and one continuous variable: box plot; faceted histogram

Three variables (Multivariate Analysis)

  • One categorical and two continuous variables: faceted scatterplot; scatterplot with colors
  • One continuous and two categorical variables: box plot grouped by categorical variables
13 / 17

Practice!

The next slide lists a set of tasks. In small groups, use the scorecard dataset (refer to the code on previous slides to load it) to create the most suitable graph for each task. Afterward, we’ll regroup to share code and discuss.

  • Before plotting: Consider the type of variable and the type of variation you need to represent.

  • While plotting: Keep it simple, as you would for an initial EDA. There’s no need to add titles, axis labels, etc. for this exercise.

  • After plotting: Stare at the graph... look for patterns, outliers, or any notable features, and substantively interpret the graph.

14 / 17

Practice!

TASK 1: Display the annual total cost of school attendance across the U.S. Hint: only one variable (cost)

TASK 2: Display the total number of schools in the U.S. by school type. Hint: only one variable (type)

TASK 3: Display the annual total cost and net cost of attendance to schools in the U.S.

TASK 4: Display the total number of schools in the U.S. by school type (n = 3) and by state (n = 54). Note: the initial graph you generate here may lack visual appeal. Focus on identifying potential improvements rather than implementing them for now.

TASK 5: Display the annual total cost of attendance by school type (variables cost and type)

TASK 6: Display the annual total cost of attendance and net cost of attendance by school type (variables cost, netcost, type)

15 / 17

Discussion

File for sharing solutions: https://codeshare.io/vAzK44

Download today’s class materials from our website for further insights into these tasks and for additional practice exercises!

16 / 17

Further insights on these tasks, and more exercises using this datasets are in today's class materials (downloadable from the website)

17 / 17

Agenda

  • Exploratory vs. Confirmatory Data Analysis

  • Using Graphs for Data Analysis (practice with the scorecard data):

    • Display variation & co-variation
    • Match type of plot to variable type!
2 / 17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow