HW3: in-class demo (from last lecture in-class materials)
Tidy data with tidyr
: theory and practice
Download last lecture in-class materials from the website (scotus-lecture.Rmd
).
Goals: review lecture 6 concepts (importing data, relational data), and example of manipulating HW3 data
tidyr
: theory and practice"Tidy data" is a system or a particular way to organize data.
"Tidy data" is not the only way of organizing data, but useful because it is:
standardized (every tidy dataset must follow rules)
consistent (data formatted in a tidy way can be used without further data manipulation by various packages in the tidyverse
)
clear (data formatted in a tidy way are easier to understand and use)
"Tidy data" is a system or a particular way to organize data.
"Tidy data" is not the only way of organizing data, but useful because it is:
standardized (every tidy dataset must follow rules)
consistent (data formatted in a tidy way can be used without further data manipulation by various packages in the tidyverse
)
clear (data formatted in a tidy way are easier to understand and use)
If you get some data and you know you want to work within the tidyverse
(e.g., ggplot
, dplyr
, etc), put them in a tidy format first!
Illustrations from the blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
It might sound counterintutive, but most data you will encounter in real life is stored in an untidy format.
Why so?
It might sound counterintutive, but most data you will encounter in real life is stored in an untidy format.
Why so?
Our book suggests two main reasons:
Data is often organized not to facilitate data wrangling and analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
Not everyone working with data is familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.
OK, you loaded your data into R and they are untidy. What should you do to tidy them up?
First, stare at the data:
Then:
OK, you loaded your data into R and they are untidy. What should you do to tidy them up?
First, stare at the data:
Then:
Typical issues:
Pivoting
Separating
Uniting
Remember to check the tidyr
documentation: https://tidyr.tidyverse.org/
NB: every function in this package behaves differently (e.g., functions do not have the same arguments!)
We are going to illustrate these tasks using one of the datasets from the readings (Chapter 5 https://r4ds.hadley.nz/data-tidy)
This is a longitudinal dataset with four variables: country
, year
, cases
(number of tubercolosis cases), and population
.
We analyze:
tidyr
functions to tidy each of these datasets: pivoting (longer and wider), separate, and unite functions library(tidyverse)table1
## # A tibble: 6 × 4## country year cases population## <chr> <dbl> <dbl> <dbl>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Why is this dataset untidy? How can we tidy it?
library(tidyverse)table4a
## # A tibble: 3 × 3## country `1999` `2000`## <chr> <dbl> <dbl>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
Why is this dataset untidy? How can we tidy it?
library(tidyverse)table4a
## # A tibble: 3 × 3## country `1999` `2000`## <chr> <dbl> <dbl>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
"Each variable must have its own column": column names must be names of variables, not variables' values (1999 and 2000 are values of the year
variable)
"Each observation must have its own row": we have one row for every country, but this is panel data, so we should have the country-year pair to define one observation, not just country.
https://tidyr.tidyverse.org/reference/pivot_longer.html
table4a
## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
pivot_longer( data = table4a, cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Afghanistan 2000 2666## 3 Brazil 1999 37737## 4 Brazil 2000 80488## 5 China 1999 212258## 6 China 2000 213766
https://tidyr.tidyverse.org/reference/pivot_longer.html
table4a
## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
pivot_longer( data = table4a, cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Afghanistan 2000 2666## 3 Brazil 1999 37737## 4 Brazil 2000 80488## 5 China 1999 212258## 6 China 2000 213766
Why is this dataset untidy? How can we tidy it?
table2
## # A tibble: 12 × 4## country year type count## <chr> <dbl> <chr> <dbl>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
Why is this dataset untidy? How can we tidy it?
table2
## # A tibble: 12 × 4## country year type count## <chr> <dbl> <chr> <dbl>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
"Each variable must have its own column". The current values of the type
column are not values, but are variables names.
"Each observation must have its own row". Here an observation is a country in a year, which is correct, but each observation is spread across two rows.
https://tidyr.tidyverse.org/reference/pivot_wider.html
table2
## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
https://tidyr.tidyverse.org/reference/pivot_wider.html
table2
## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
pivot_wider( data = table2, names_from = type, values_from = count)
## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Why is this dataset untidy? How can we tidy it?
table3
## # A tibble: 6 × 3## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
https://tidyr.tidyverse.org/reference/separate.html
table3
## # A tibble: 6 x 3## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
https://tidyr.tidyverse.org/reference/separate.html
table3
## # A tibble: 6 x 3## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
separate( data = table3, col = rate, into = c( "cases", "population" ), convert = TRUE)
## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Why is this dataset untidy? How can we tidy it?
table5
## # A tibble: 6 × 4## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
https://tidyr.tidyverse.org/reference/unite.html
table5
## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
https://tidyr.tidyverse.org/reference/unite.html
table5
## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
unite( data = table5, col = "year", century, year)
## # A tibble: 6 × 3## country year rate ## <chr> <chr> <chr> ## 1 Afghanistan 19_99 745/19987071 ## 2 Afghanistan 20_00 2666/20595360 ## 3 Brazil 19_99 37737/172006362 ## 4 Brazil 20_00 80488/174504898 ## 5 China 19_99 212258/1272915272## 6 China 20_00 213766/1280428583
table5
## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
unite( data = table5, col = "year", century, year, # remove underscore sep = "")
## # A tibble: 6 × 3## country year rate ## <chr> <chr> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
table5
## # A tibble: 6 × 4## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
unite( data = table5, col = "year", century, year, # remove underscore sep = "") %>% # store as numeric mutate(year = as.integer(year))
## # A tibble: 6 × 3## country year rate ## <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
Download today's in-class exercises from the website.
HW3: in-class demo (from last lecture in-class materials)
Tidy data with tidyr
: theory and practice
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |