MACS 30500 LECTURE 7
Topics: Tidying data with tidyr
1 / 24

Agenda

HW3: in-class demo (from last lecture in-class materials)
Tidy data with tidyr: theory and practice

2 / 24

HW3: in-class demo

Download last lecture in-class materials from the website (scotus-lecture.Rmd).

Goals: review lecture 6 concepts (importing data, relational data), and example of manipulating HW3 data

3 / 24

Tidy data with tidyr: theory and practice4 / 24

Tidy data: definition

"Tidy data" is a system or a particular way to organize data.

"Tidy data" is not the only way of organizing data, but useful because it is:

standardized (every tidy dataset must follow rules)
consistent (data formatted in a tidy way can be used without further data manipulation by various packages in the tidyverse)
clear (data formatted in a tidy way are easier to understand and use)

5 / 24

Tidy data: definition

"Tidy data" is a system or a particular way to organize data.

"Tidy data" is not the only way of organizing data, but useful because it is:

standardized (every tidy dataset must follow rules)
consistent (data formatted in a tidy way can be used without further data manipulation by various packages in the tidyverse)
clear (data formatted in a tidy way are easier to understand and use)

If you get some data and you know you want to work within the tidyverse (e.g., ggplot, dplyr, etc), put them in a tidy format first!

5 / 24

Tidy data: three key principles

Illustrations from the blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

6 / 24

Tidy datasets are all alike

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

7 / 24

Tidy datasets are all alike

Illustrations from the blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

8 / 24

Tidy untidy data

It might sound counterintutive, but most data you will encounter in real life is stored in an untidy format.

Why so?

9 / 24

Tidy untidy data

It might sound counterintutive, but most data you will encounter in real life is stored in an untidy format.

Why so?

Our book suggests two main reasons:

Data is often organized not to facilitate data wrangling and analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
Not everyone working with data is familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

9 / 24

Tidy untidy data

OK, you loaded your data into R and they are untidy. What should you do to tidy them up?

First, stare at the data:

identify the observations and the variables
identify what's untidy about them

Then:

place each observation in a separate row
place each variable in a separate column
ensure that each value has its own cell

10 / 24

Tidy untidy data

OK, you loaded your data into R and they are untidy. What should you do to tidy them up?

First, stare at the data:

identify the observations and the variables
identify what's untidy about them

Then:

place each observation in a separate row
place each variable in a separate column
ensure that each value has its own cell

Typical issues:

A variable is spread across multiple columns or an observation across multiple rows.
Variable values are column names, or variable names are in rows.

10 / 24

Tidying tasks

Pivoting
- Longer: makes the dataset longer by increasing rows
- Wider: makes the dataset wider by increasing columns
Separating
Uniting

Remember to check the tidyr documentation: https://tidyr.tidyverse.org/

NB: every function in this package behaves differently (e.g., functions do not have the same arguments!)

11 / 24

In-class Example: our goals

We are going to illustrate these tasks using one of the datasets from the readings (Chapter 5 https://r4ds.hadley.nz/data-tidy)

This is a longitudinal dataset with four variables: country, year, cases (number of tubercolosis cases), and population.

We analyze:

several ways to organize the same dataset
how to use tidyr functions to tidy each of these datasets: pivoting (longer and wider), separate, and unite functions

12 / 24

In-class Example: the data

library(tidyverse)
table1

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

13 / 24

Pivot longer

Why is this dataset untidy? How can we tidy it?

library(tidyverse)
table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

14 / 24

Pivot longer

Why is this dataset untidy? How can we tidy it?

library(tidyverse)
table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

"Each variable must have its own column": column names must be names of variables, not variables' values (1999 and 2000 are values of the year variable)

"Each observation must have its own row": we have one row for every country, but this is panel data, so we should have the country-year pair to define one observation, not just country.

14 / 24

Pivot longer

https://tidyr.tidyverse.org/reference/pivot_longer.html

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

pivot_longer(
  data = table4a,
  cols = c(`1999`, `2000`),
  names_to = "year",
  values_to = "cases"
)

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

15 / 24

Pivot longer

https://tidyr.tidyverse.org/reference/pivot_longer.html

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

pivot_longer(
  data = table4a,
  cols = c(`1999`, `2000`),
  names_to = "year",
  values_to = "cases"
)

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

15 / 24

Pivot wider

Why is this dataset untidy? How can we tidy it?

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

16 / 24

Pivot wider

Why is this dataset untidy? How can we tidy it?

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

"Each variable must have its own column". The current values of the type column are not values, but are variables names.

"Each observation must have its own row". Here an observation is a country in a year, which is correct, but each observation is spread across two rows.

16 / 24

Pivot wider

https://tidyr.tidyverse.org/reference/pivot_wider.html

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

17 / 24

Pivot wider

https://tidyr.tidyverse.org/reference/pivot_wider.html

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

pivot_wider(
  data = table2,
  names_from = type,
  values_from = count
)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

17 / 24

Separating

Why is this dataset untidy? How can we tidy it?

table3

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

18 / 24

Separating

https://tidyr.tidyverse.org/reference/separate.html

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

19 / 24

Separating

https://tidyr.tidyverse.org/reference/separate.html

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

separate(
  data = table3,
  col = rate,
  into = c(
    "cases",
    "population"
  ),
  convert = TRUE
)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

19 / 24

Uniting

Why is this dataset untidy? How can we tidy it?

table5

## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

20 / 24

Uniting

https://tidyr.tidyverse.org/reference/unite.html

table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

21 / 24

Uniting

https://tidyr.tidyverse.org/reference/unite.html

table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

unite(
  data = table5,
  col = "year", century, year
)

## # A tibble: 6 × 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

21 / 24

Unitingtable5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
unite(
  data = table5,
  col = "year", century, year,
  # remove underscore
  sep = ""
)

## # A tibble: 6 × 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583
22 / 24

Unitingtable5

## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
unite(
  data = table5,
  col = "year", century, year,
  # remove underscore
  sep = ""
) %>%
  # store as numeric
  mutate(year = as.integer(year))

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
23 / 24

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

MACS 30500 LECTURE 7

Topics: Tidying data with tidyr

Agenda

HW3: in-class demo

Tidy data with `tidyr`: theory and practice

Tidy data: definition

Tidy data: definition

Tidy data: three key principles

Tidy datasets are all alike

Tidy datasets are all alike

Tidy untidy data

Tidy untidy data

Tidy untidy data

Tidy untidy data

Tidying tasks

In-class Example: our goals

In-class Example: the data

Pivot longer

Pivot longer

Pivot longer

Pivot longer

Pivot wider

Pivot wider

Pivot wider

Pivot wider

Separating

Separating

Separating

Uniting

Uniting

Uniting

Uniting

Uniting

Practice tidying data

Agenda

Help