class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 4 ] .author[ ### Topics: Exploratory vs. Confirmatory Data Analysis. Using Graphs for Data Analysis. ] --- class: inverse, middle # Agenda * Exploratory vs. Confirmatory Data Analysis * Using Graphs for Data Analysis (practice with the `scorecard` data): * Display variation & co-variation * Match type of plot to variable type! <!-- NOTES FROM FALL 2024 EDITION, AND WHAT TO IMPROVE FOR NEXT YEAR did in class exercise as I had in my note below and worked much better than me talking through the slides. Took more time than anticipated and prompts (the 6 tasks could be shortened and clarified, maybe cut one). Had to move Factors to next lecture; prefer not to. Further cut and improve initial slides on EDA (should take max 15 min) --> <!-- Main takeaways from today's lecture: 1. match plot to variable type 1. do not stop after one plot... experiment with different visualizations 1. effective use of visuals for EDA often require data transformation (`dplyr` + `ggplot`) Goals for this time (summer 2024) were (1) to make this lecture more interactive and (2) to improve the message of matching the plot with the right variable types There is still room for improvement; idea for next time: start with the intro on what is EDA (me lecturing, but also keep the interactive session on other ways to code that penguins example); then start the section titled "EDA with scorecard dataset" with a slide that lists all "tasks" and ask students to code them first, in team of 2 and give specific instructions: identify the type of variables you need to use, explain which plot is the most suitable to represent that relationship and why (more than one plot might be suitable, pick one); produce the code and put code and plot (copy-paste it?) in this shared document under your "task") have students work on it for 10 min then we share insights and compare results; have student share their code on google or something that everyone can see so we can run the plot interactively in class might want to increase or vary the difficulty level of tasks and to give an intro to the dataset together before jumping into it the bulk of the lecture should be this activity SUGGESTIONS FOR NEXT TIME: say that this is a key lecture and you want to return to it in future hw, including hte last two (hw6 and final project that are more open ended), mostly we want variety in graphs (bar graphs are great but there are other ways to display two cat variables) show more type of graphs and give an exercise on that add the vis cheat sheet the ggplot2 cheatsheet is great for ideas on different types of graph (click on Download pdf) https://rstudio.github.io/cheatsheets/html/data-visualization.html?_gl=1*16veqlu*_ga*MjA5MDcxNzYwNC4xNzE1Nzg4MDI2*_ga_2C0WZ1JHG0*MTcyMTMxOTYwNC43LjAuMTcyMTMxOTYwNC4wLjAuMA --> --- class: inverse, middle # Exploratory vs. Confirmatory Data Analysis --- ### Exploratory vs. Confirmatory Data Analysis **Exploratory Data Analysis (EDA)** All set of exploratory investigations to understand your data and generate questions. Goals: discover patterns and insights, spot anomalies (outliers), generate questions, formulate hypotheses. **Confirmatory Data Analysis (CDA)** Also called Explanatory Data Analysis. Generally hypothesis-driven, and comes after EDA, to formally test hypotheses (e.g., modeling) -- *Today, and in this course, we mostly focus on EDA!* --- ### EDA as Iterative and Creative Process **EDA is an Iterative Process:** 1. Generate exploratory questions about your data 1. Search for answers in the data 1. Use what you learn to refine your questions and/or generate new questions 1. Repeat until necessary -- **EDA is a also a Creative Process:** EDA is not an exact science, and requires curiosity about the data, intuition, and patience. At the most basic level, it involves answering two questions: how values within a single variable vary? how values of two variables co-vary? --- ### EDA relies on... **Descriptive stats** such as frequency counts, measures of central tendency (mean, mode, median), and dispersion (range, variance, standard deviation). **Visualizations** such histograms, bar charts, scatter plots, etc. We focus on visualizations, and specifically we display: - Variation: how values within a single variable vary (univariate analysis) - Covariation: how values of two variables co-vary (bivariate analysis) -- *Visualizations are essential tools in both EDA and CDA, each with distinct purposes. Even within EDA, we can utilize visualizations in multiple ways. Let’s look at an example with the penguins data to illustrate this point.* <!-- In Exploratory Analysis you might generate 10, 20 or even 100 graphs, but not all of them will be useful for your research. In Confirmatory Analysis, you generate only a few graphs and each graph is well refined and will be put in your final report or research. --> --- ### Example: First Plot ```r library(palmerpenguins) data("penguins") head(penguins) ``` ``` ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## # ℹ 2 more variables: sex <fct>, year <int> ``` TASK: build a plot of two continuous variables (penguins body mass and flipper length). <!-- ASK: if we have two continuous antiquate variables and so we want to show co-variation among them... what is the best plot to use? --> --- count: false ### First Plot .panel1-penguins-eda-auto[ ```r *ggplot( * data = penguins, * mapping = aes( * x = body_mass_g, * y = flipper_length_mm) * ) ``` ] .panel2-penguins-eda-auto[ <img src="index_files/figure-html/penguins-eda_auto_01_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### First Plot .panel1-penguins-eda-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + * geom_point() ``` ] .panel2-penguins-eda-auto[ <img src="index_files/figure-html/penguins-eda_auto_02_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### First Plot .panel1-penguins-eda-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point() + * geom_smooth() ``` ] .panel2-penguins-eda-auto[ <img src="index_files/figure-html/penguins-eda_auto_03_output-1.png" width="80%" style="display: block; margin: auto;" /> ] <style> .panel1-penguins-eda-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-penguins-eda-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-penguins-eda-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> <!-- notice in the code I put the data and mapping above, right after ggplot() because both geometries use the same data and mapping --> -- Once you draw such initial plot, ask yourself: * *Substantive questions:* What does this graph tell us? Are there patterns? Outliers? What hypotheses can we generate from it? Is the chosen plot appropriate here? * *Stylistic questions:* What are the strengths and limitations of this quick visualization? How could we improve it? <!-- Pros: minimum code, easy to replicate, good for your internal use Cons: not well refined, not good for publication or external audience --> --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r *ggplot( * data = penguins, * mapping = aes( * x = body_mass_g, * y = flipper_length_mm) * ) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_01_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + * geom_point(alpha = .1) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_02_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + * geom_smooth(method = "lm", * se = FALSE) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_03_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE) + * labs( * title = "Relationship between body mass and\nflipper length of a penguin", * subtitle = "Sample of 344 penguins", * x = "Body mass (g)", * y = "Flipper length (mm)" * ) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_04_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false ### Refined Plot .panel1-penguins-final-auto[ ```r ggplot( data = penguins, mapping = aes( x = body_mass_g, y = flipper_length_mm) ) + geom_point(alpha = .1) + geom_smooth(method = "lm", se = FALSE) + labs( title = "Relationship between body mass and\nflipper length of a penguin", subtitle = "Sample of 344 penguins", x = "Body mass (g)", y = "Flipper length (mm)" ) + * theme_xaringan( * title_font_size = 18, * text_font_size = 16 * ) ``` ] .panel2-penguins-final-auto[ <img src="index_files/figure-html/penguins-final_auto_05_output-1.png" width="80%" style="display: block; margin: auto;" /> ] <style> .panel1-penguins-final-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-penguins-final-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-penguins-final-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> <!-- notice here geom_point and geom_smooth also have their own arguments that work only for them; for example: if I want to make the points of the scatterplot more faded I use alpha there; or here I wanted to use a specific method to draw the regression line vs. let R selecting it automatically, so I specify the method and I also do se = FALSE as the default is true and shows the standard error labs() and theme_xarigan() control the non-data elements of the plot; lab(): controls things like title, subtitles etc; theme() controls background colors, font size, etc. Notice there is also a geometry that controls titles etc. and it is called geom_text() Note the difference between `labs()` and `geom_text()` Book "`ggplot2`: Elegant Graphics for Data Analysis" https://ggplot2-book.org/annotations.html --> --- ## Takeaway In this course, our focus is primarily on Exploratory Data Analysis (EDA); we won’t delve into Confirmatory Data Analysis or formal hypothesis testing. **Still, your approach to plotting should follow this sequence:** * Begin by creating several quick plots to explore the data: * Pros: save coding time and focus on observing the data. * Avoid adjusting stylistic components at this stage; instead, refine by removing outliers, changing graph types, wrangle data, etc. until you’re satisfied with the plot. * Once you have settled on a plot: * Improve its visual appeal by refining the code. Add elements such as labels, legends, color adjustments, scales, themes, facets, etc. * Use this polished version for reports, presentations, and your final plots for the assignments in this course! <!-- you should approach HW2 with this slide in mind! --> --- class: inverse, middle # Using Graphs for Data Analysis #### We use the `scorecard` data to practice using graphs for data analysis, and specifically: * Display variation & co-variation * Match type of plot to variable type! --- ### The `scorecard` dataset ```r library(tidyverse) library(rcis) data("scorecard") ``` The Department of Education collects annual statistics on colleges and universities in the United States. Data include: universities names, state, type, admission rate, costs, etc. We are going to look at a subset of this data, from 2018-19. --- ### The `scorecard` dataset ```r glimpse(scorecard) ``` ``` ## Rows: 1,732 ## Columns: 14 ## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009… ## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin… ## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", … ## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", … ## $ admrate <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53… ## $ satavg <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076… ## $ cost <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431… ## $ netcost <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071… ## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400… ## $ pctpell <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23… ## $ comprate <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69… ## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381… ## $ debt <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425… ## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb… ``` --- ### Types of Visualizations and Best Graph Types Do I want to represent variations within... * Single variable * Two variables * Three variables Is or are my variables... * Continuous * Categorical * Other types (discrete, ordinal, nominal, etc.) --- ### Types of Visualizations and Best Graph Types **Single variable (Univariate Analysis) to display how values within a single variable vary** * One continuous variable: histogram * One categorical variable: bar plot **Two variables (Bivariate Analysis) to display how they co-vary** * Two continuous variables: scatter plot * Two categorical variables: (grouped or stacked) bar plot; dot or mosaic plot * One categorical and one continuous variable: box plot; faceted histogram **Three variables (Multivariate Analysis)** * One categorical and two continuous variables: faceted scatterplot; scatterplot with colors * One continuous and two categorical variables: box plot grouped by categorical variables <!-- *We are going review all plots in bold today, both code and interpretation. The plots listed in this slide are the most common options, but there are more possibilities to explore.* *Takeaways: experiment and match the type of plot with the variable types!* --> --- ## Practice! The next slide lists a set of tasks. In small groups, use the `scorecard` dataset (refer to the code on previous slides to load it) to create the most suitable graph for each task. Afterward, we’ll regroup to share code and discuss. * **Before plotting:** Consider the type of variable and the type of variation you need to represent. * **While plotting:** Keep it simple, as you would for an initial EDA. There’s no need to add titles, axis labels, etc. for this exercise. * **After plotting:** Stare at the graph... look for patterns, outliers, or any notable features, and substantively interpret the graph. --- ## Practice! TASK 1: Display the annual total cost of school attendance across the U.S. *Hint: only one variable (`cost`)* TASK 2: Display the total number of schools in the U.S. by school type. *Hint: only one variable (`type`)* TASK 3: Display the annual total cost and net cost of attendance to schools in the U.S. TASK 4: Display the total number of schools in the U.S. by school type (n = 3) and by state (n = 54). *Note: the initial graph you generate here may lack visual appeal. Focus on identifying potential improvements rather than implementing them for now.* TASK 5: Display the annual total cost of attendance by school type (variables `cost` and `type`) TASK 6: Display the annual total cost of attendance and net cost of attendance by school type (variables `cost`, `netcost`, `type`) --- ## Discussion File for sharing solutions: https://codeshare.io/vAzK44 Download today’s class materials from our website for further insights into these tasks and for additional practice exercises! --- class: inverse, middle ### Further insights on these tasks, and more exercises using this datasets are in today's class materials (downloadable from the website)