Exploratory vs. Confirmatory Data Analysis
Using Graphs for Data Analysis (practice with the scorecard
data):
Exploratory Data Analysis (EDA)
All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses.
Confirmatory Data Analysis (CDA)
Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling)
Exploratory Data Analysis (EDA)
All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses.
Confirmatory Data Analysis (CDA)
Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling)
Today, and in this course, we mostly focus on EDA!
EDA is an Iterative Process:
EDA is an Iterative Process:
EDA is a also a Creative Process:
EDA is not an exact science, and requires curiosity about the data, intuition, and patience. At the most basic level, it involves answering two questions: how values within a single variable vary? how values of two variables co-vary?
Descriptive stats such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation).
Visualizations such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display:
Descriptive stats such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation).
Visualizations such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display:
Visualizations are essential tools in both EDA and CDA, each with distinct purposes. Even within EDA, we can utilize visualizations in multiple ways. Let’s look at an example with the penguins data to illustrate this point.
library(palmerpenguins)data("penguins")head(penguins)
## # A tibble: 6 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Torgersen 39.1 18.7 181 3750## 2 Adelie Torgersen 39.5 17.4 186 3800## 3 Adelie Torgersen 40.3 18 195 3250## 4 Adelie Torgersen NA NA NA NA## 5 Adelie Torgersen 36.7 19.3 193 3450## 6 Adelie Torgersen 39.3 20.6 190 3650## # ℹ 2 more variables: sex <fct>, year <int>
TASK: build a plot of two continuous variables (penguins body mass and flipper length).
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) )
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point()
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point() + geom_smooth()
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point() + geom_smooth()
Once you draw such initial plot, ask yourself:
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) )
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1)
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE)
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE) + labs( title = "Relationship between body mass and\nflipper length of a penguin", subtitle = "Sample of 344 penguins", x = "Body mass (g)", y = "Flipper length (mm)" )
ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE) + labs( title = "Relationship between body mass and\nflipper length of a penguin", subtitle = "Sample of 344 penguins", x = "Body mass (g)", y = "Flipper length (mm)" ) + theme_xaringan( title_font_size = 18, text_font_size = 16 )
In this course, our focus is primarily on Exploratory Data Analysis (EDA); we won’t delve into Confirmatory Data Analysis or formal hypothesis testing.
Still, your approach to plotting should follow this sequence:
Begin by creating several quick plots to explore the data:
Once you have settled on a plot:
scorecard
data to practice using graphs for data analysis, and specifically:scorecard
datasetlibrary(tidyverse)library(rcis)data("scorecard")
The Department of Education collects annual statistics on colleges and universities in the United States. Data include: universities names, state, type, admission rate, costs, etc.
We are going to look at a subset of this data, from 2018-19.
scorecard
datasetglimpse(scorecard)
## Rows: 1,732## Columns: 14## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin…## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", …## $ admrate <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53…## $ satavg <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076…## $ cost <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431…## $ netcost <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071…## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400…## $ pctpell <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23…## $ comprate <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69…## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…## $ debt <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425…## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb…
Do I want to represent variations within...
Is or are my variables...
Single variable (Univariate Analysis) to display how values within a single variable vary
Two variables (Bivariate Analysis) to display how they co-vary
Three variables (Multivariate Analysis)
The next slide lists a set of tasks. In small groups, use the scorecard
dataset (refer to the code on previous slides to load it) to create the most suitable graph for each task. Afterward, we’ll regroup to share code and discuss.
Before plotting: Consider the type of variable and the type of variation you need to represent.
While plotting: Keep it simple, as you would for an initial EDA. There’s no need to add titles, axis labels, etc. for this exercise.
After plotting: Stare at the graph... look for patterns, outliers, or any notable features, and substantively interpret the graph.
TASK 1: Display the annual total cost of school attendance across the U.S. Hint: only one variable (cost
)
TASK 2: Display the total number of schools in the U.S. by school type. Hint: only one variable (type
)
TASK 3: Display the annual total cost and net cost of attendance to schools in the U.S.
TASK 4: Display the total number of schools in the U.S. by school type (n = 3) and by state (n = 54). Note: the initial graph you generate here may lack visual appeal. Focus on identifying potential improvements rather than implementing them for now.
TASK 5: Display the annual total cost of attendance by school type (variables cost
and type
)
TASK 6: Display the annual total cost of attendance and net cost of attendance by school type (variables cost
, netcost
, type
)
File for sharing solutions: https://codeshare.io/vAzK44
Download today’s class materials from our website for further insights into these tasks and for additional practice exercises!
Exploratory vs. Confirmatory Data Analysis
Using Graphs for Data Analysis (practice with the scorecard
data):
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |