class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 4 ] .author[ ### Topics: Exploratory vs. Confirmatory Data Analysis. Using Graphs for Data Analysis. ] --- class: inverse, middle # Agenda * Exploratory vs. Confirmatory Data Analysis * Using Graphs for Data Analysis (practice with the `scorecard` data): * Display variation & co-variation * Match type of plot to variable type! **Exploratory Data Analysis (EDA)** All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses. **Confirmatory Data Analysis (CDA)** Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling) -- *Today, and in this course, we mostly focus on EDA!* --- ### EDA as Iterative and Creative Process **EDA is an Iterative Process:** 1. Generate exploratory questions about your data 1. Search for answers in the data 1. Use what you learn to refine your questions and/or generate new questions 1. Repeat until necessary -- **EDA is a also a Creative Process:** EDA is not an exact science, and requires curiosity about the data, intuition, and patience. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses. **Confirmatory Data Analysis (CDA)** Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling) -- *Today, and in this course, we mostly focus on EDA!* --- ### EDA as Iterative and Creative Process **EDA is an Iterative Process:** 1. Generate exploratory questions about your data 1. Search for answers in the data 1. Use what you learn to refine your questions and/or generate new questions 1. Repeat until necessary -- **EDA is a also a Creative Process:** EDA is not an exact science, and requires curiosity about the data, intuition, and patience. At the most basic level, it involves answering two questions: how values within a single variable vary? how values of two variables co-vary? --- ### EDA relies on... **Descriptive stats** such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation). **Visualizations** such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display: - Variation: how values within a single variable vary (univariate analysis) - Covariation: how values of two variables co-vary (bivariate analysis) -- *Visualizations are essential tools in both EDA and CDA, each with distinct purposes. Even within EDA, we can utilize visualizations in multiple ways. Let's look at an example with the penguins data to illustrate this point.* In Confirmatory Analysis, you generate only a few graphs and each graph is well refined and will be put in your final report or research. --> --- ### Example: First Plot ```r library(palmerpenguins) data("penguins") head(penguins) ``` ``` ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## # ℹ 2 more variables: sex <fct>, year <int> ``` TASK: build a plot of two continuous variables (penguins body mass and flipper length). <!-- ASK: if we have two continuous antiquate variables and so we want to show co-variation among them... what is the best plot to use? --> --- count: false ### First Plot .panel1-penguins-eda-auto[ ```r *ggplot( * data = penguins, * mapping = aes( * x = body_mass_g, * y = flipper_length_mm) * ) ``` ] .panel2-penguins-eda-auto[ <img src="index_files/figure-html/penguins-eda_auto_01_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### First Plot .panel1-penguins-eda-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + * geom_point() ``` ] .panel2-penguins-eda-auto[ <img src="index_files/figure-html/penguins-eda_auto_02_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### First Plot .panel1-penguins-eda-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point() + * geom_smooth() ``` ] .panel2-penguins-eda-auto[ <img src="index_files/figure-html/penguins-eda_auto_03_output-1.png" width="80%" style="display: block; margin: auto;" /> ] <style> .panel1-penguins-eda-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-penguins-eda-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-penguins-eda-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> <!-- notice in the code I put the data and mapping above, right after ggplot() because both geometries use the same data and mapping --> -- Once you draw such initial plot, ask yourself: * *Substantive questions:* What does this graph tell us? Are there patterns? Outliers? What hypotheses can we generate from it? Is the chosen plot appropriate here? * *Stylistic questions:* What are the strengths and limitations of this quick visualization? How could we improve it? --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r *ggplot( * data = penguins, * mapping = aes( * x = body_mass_g, * y = flipper_length_mm) * ) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_01_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + * geom_point(alpha = .1) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_02_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + * geom_smooth(method = "lm", * se = FALSE) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_03_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE) + * labs( * title = "Relationship between body mass and\nflipper length of a penguin", * subtitle = "Sample of 344 penguins", * x = "Body mass (g)", * y = "Flipper length (mm)" * ) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_04_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE) + labs( title = "Relationship between body mass and\nflipper length of a penguin", subtitle = "Sample of 344 penguins", x = "Body mass (g)", y = "Flipper length (mm)" ) + * theme_xaringan( * title_font_size = 18, * text_font_size = 16 * ) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_05_output-1.png" width="80%" style="display: block; margin: auto;" /> ] <style> .panel1-penguins-final-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-penguins-final-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-penguins-final-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> Add elements such as labels, legends, color adjustments, scales, themes, facets, etc. * Use this polished version for reports, presentations, and your final plots for the assignments in this course! <!-- you should approach HW2 with this slide in mind! --> --- class: inverse, middle # Using Graphs for Data Analysis #### We use the `scorecard` data to practice using graphs for data analysis, and specifically: * Display variation & co-variation * Match type of plot to variable type! --- ### The `scorecard` dataset ```r library(tidyverse) library(rcis) data("scorecard") ``` The Department of Education collects annual statistics on colleges and universities in the United States. Data include: universities names, state, type, admission rate, costs, etc. We are going to look at a subset of this data, from 2018-19. --- ### The `scorecard` dataset ```r glimpse(scorecard) ``` ``` ## Rows: 1,732 ## Columns: 14 ## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009… ## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin… ## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", … ## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", … ## $ admrate <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53… ## $ satavg <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076… ## $ cost <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431… ## $ netcost <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071… ## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400… ## $ pctpell <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23… ## $ comprate <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69… ## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381… ## $ debt <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425… ## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb… ``` --- ### Types of Visualizations and Best Graph Types Do I want to represent variations within... * Single variable * Two variables * Three variables Is or are my variables... * Continuous * Categorical * Other types (discrete, ordinal, nominal, etc.) --- ### Types of Visualizations and Best Graph Types **Single variable (Univariate Analysis) to display how values within a single variable vary** * One continuous variable: histogram * One categorical variable: bar plot **Two variables (Bivariate Analysis) to display how they co-vary** * Two continuous variables: scatter plot * Two categorical variables: (grouped or stacked) bar plot; dot or mosaic plot * One categorical and one continuous variable: box plot; faceted histogram **Three variables (Multivariate Analysis)** * One categorical and two continuous variables: faceted scatterplot; scatterplot with colors * One continuous and two categorical variables: box plot grouped by categorical variables <!-- *We are going review all plots in bold today, both code and interpretation. The plots listed in this slide are the most common options, but there are more possibilities to explore.* *Takeaways: experiment and match the type of plot with the variable types!* --> --- ## Practice! The next slide lists a set of tasks. In small groups, use the `scorecard` dataset (refer to the code on previous slides to load it) to create the most suitable graph for each task. Afterward, we’ll regroup to share code and discuss. * **Before plotting:** Consider the type of variable and the type of variation you need to represent. * **While plotting:** Keep it simple, as you would for an initial EDA. There’s no need to add titles, axis labels, etc. for this exercise. * **After plotting:** Stare at the graph... look for patterns, outliers, or any notable features, and substantively interpret the graph. --- ## Practice! TASK 1: Display the annual total cost of school attendance across the U.S. *Hint: only one variable (`cost`)* TASK 2: Display the total number of schools in the U.S. by school type. *Hint: only one variable (`type`)* TASK 3: Display the annual total cost and net cost of attendance to schools in the U.S. TASK 4: Display the total number of schools in the U.S. by school type (n = 3) and by state (n = 54). *Note: the initial graph you generate here may lack visual appeal. Focus on identifying potential improvements rather than implementing them for now.* TASK 5: Display the annual total cost of attendance by school type (variables `cost` and `type`) TASK 6: Display the annual total cost of attendance and net cost of attendance by school type (variables `cost`, `netcost`, `type`) --- ## Discussion File for sharing solutions: Download today’s class materials from our website for further insights into these tasks and for additional practice exercises! --- class: inverse, middle ### Further insights on these tasks, and more exercises using this datasets are in today's class materials (downloadable from the website)