+ - 0:00:00
Notes for current slide
Notes for next slide

MACS 30500 LECTURE 7

Topics: Tidying data with tidyr

1 / 24

Agenda

  • HW3: in-class demo (from last lecture in-class materials)

  • Tidy data with tidyr: theory and practice

2 / 24

HW3: in-class demo

Download last lecture in-class materials from the website (scotus-lecture.Rmd).

Goals: review lecture 6 concepts (importing data, relational data), and example of manipulating HW3 data

3 / 24

Tidy data with tidyr: theory and practice

4 / 24

Tidy data: definition

"Tidy data" is a system or a particular way to organize data.

"Tidy data" is not the only way of organizing data, but useful because it is:

  • standardized (every tidy dataset must follow rules)

  • consistent (data formatted in a tidy way can be used without further data manipulation by various packages in the tidyverse)

  • clear (data formatted in a tidy way are easier to understand and use)

5 / 24

Tidy data: definition

"Tidy data" is a system or a particular way to organize data.

"Tidy data" is not the only way of organizing data, but useful because it is:

  • standardized (every tidy dataset must follow rules)

  • consistent (data formatted in a tidy way can be used without further data manipulation by various packages in the tidyverse)

  • clear (data formatted in a tidy way are easier to understand and use)

If you get some data and you know you want to work within the tidyverse (e.g., ggplot, dplyr, etc), put them in a tidy format first!

5 / 24

Tidy data: three key principles

Stylized text providing an overview of Tidy Data. The top reads 'Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.' On the left reads 'In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.' There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.

Illustrations from the blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

6 / 24

Tidy datasets are all alike

There are two sets of anthropomorphized data tables. The top group of three tables are all rectangular and smiling, with a shared speech bubble reading 'our columns are variables and our rows are observations!'. Text to the left of that group reads 'The standard structure of tidy data means that 'tidy datasets are all alike…' The lower group of four tables are all different shapes, look ragged and concerned, and have different speech bubbles reading (from left to right) 'my column are values and my rows are variables', 'I have variables in columns AND in rows', 'I have multiple variables in a single column', and 'I don’t even KNOW what my deal is.' Next to the frazzled data tables is text '...but every messy dataset is messy in its own way. -Hadley Wickham.'

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

7 / 24

Tidy datasets are all alike

On the left is a happy cute fuzzy monster holding a rectangular data frame with a tool that fits the data frame shape. On the workbench behind the monster are other data frames of similar rectangular shape, and neatly arranged tools that also look like they would fit those data frames. The workbench looks uncluttered and tidy. The text above the tidy workbench reads 'When working with tidy data, we can use the same tools in similar ways for different datasets…' On the right is a cute monster looking very frustrated, using duct tape and other tools to haphazardly tie data tables together, each in a different way. The monster is in front of a messy, cluttered workbench. The text above the frustrated monster reads '...but working with untidy data often means reinventing the wheel with one-time approaches that are hard to iterate or reuse.'

Illustrations from the blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

8 / 24

Tidy untidy data

It might sound counterintutive, but most data you will encounter in real life is stored in an untidy format.

Why so?

9 / 24

Tidy untidy data

It might sound counterintutive, but most data you will encounter in real life is stored in an untidy format.

Why so?

Our book suggests two main reasons:

  • Data is often organized not to facilitate data wrangling and analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.

  • Not everyone working with data is familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

9 / 24

Tidy untidy data

OK, you loaded your data into R and they are untidy. What should you do to tidy them up?

First, stare at the data:

  • identify the observations and the variables
  • identify what's untidy about them

Then:

  • place each observation in a separate row
  • place each variable in a separate column
  • ensure that each value has its own cell
10 / 24

Tidy untidy data

OK, you loaded your data into R and they are untidy. What should you do to tidy them up?

First, stare at the data:

  • identify the observations and the variables
  • identify what's untidy about them

Then:

  • place each observation in a separate row
  • place each variable in a separate column
  • ensure that each value has its own cell

Typical issues:

  1. A variable is spread across multiple columns or an observation across multiple rows.
  2. Variable values are column names, or variable names are in rows.
10 / 24

Tidying tasks

  • Pivoting

    • Longer: makes the dataset longer by increasing rows
    • Wider: makes the dataset wider by increasing columns
  • Separating

  • Uniting

Remember to check the tidyr documentation: https://tidyr.tidyverse.org/

NB: every function in this package behaves differently (e.g., functions do not have the same arguments!)

11 / 24

In-class Example: our goals

We are going to illustrate these tasks using one of the datasets from the readings (Chapter 5 https://r4ds.hadley.nz/data-tidy)

This is a longitudinal dataset with four variables: country, year, cases (number of tubercolosis cases), and population.

We analyze:

  • several ways to organize the same dataset
  • how to use tidyr functions to tidy each of these datasets: pivoting (longer and wider), separate, and unite functions
12 / 24

In-class Example: the data

library(tidyverse)
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
13 / 24

Pivot longer

Why is this dataset untidy? How can we tidy it?

library(tidyverse)
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
14 / 24

Pivot longer

Why is this dataset untidy? How can we tidy it?

library(tidyverse)
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

"Each variable must have its own column": column names must be names of variables, not variables' values (1999 and 2000 are values of the year variable)

"Each observation must have its own row": we have one row for every country, but this is panel data, so we should have the country-year pair to define one observation, not just country.

14 / 24

Pivot longer

https://tidyr.tidyverse.org/reference/pivot_longer.html

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
pivot_longer(
data = table4a,
cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"
)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
15 / 24

Pivot longer

https://tidyr.tidyverse.org/reference/pivot_longer.html

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
pivot_longer(
data = table4a,
cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"
)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
15 / 24

Pivot wider

Why is this dataset untidy? How can we tidy it?

table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
16 / 24

Pivot wider

Why is this dataset untidy? How can we tidy it?

table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583

"Each variable must have its own column". The current values of the type column are not values, but are variables names.

"Each observation must have its own row". Here an observation is a country in a year, which is correct, but each observation is spread across two rows.

16 / 24

Pivot wider

https://tidyr.tidyverse.org/reference/pivot_wider.html

table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
17 / 24

Pivot wider

https://tidyr.tidyverse.org/reference/pivot_wider.html

table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
pivot_wider(
data = table2,
names_from = type,
values_from = count
)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
17 / 24

Separating

Why is this dataset untidy? How can we tidy it?

table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
18 / 24

Separating

https://tidyr.tidyverse.org/reference/separate.html

table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
19 / 24

Separating

https://tidyr.tidyverse.org/reference/separate.html

table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
separate(
data = table3,
col = rate,
into = c(
"cases",
"population"
),
convert = TRUE
)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
19 / 24

Uniting

Why is this dataset untidy? How can we tidy it?

table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
20 / 24

Uniting

https://tidyr.tidyverse.org/reference/unite.html

table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
21 / 24

Uniting

https://tidyr.tidyverse.org/reference/unite.html

table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite(
data = table5,
col = "year", century, year
)
## # A tibble: 6 × 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
21 / 24

Uniting

table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite(
data = table5,
col = "year", century, year,
# remove underscore
sep = ""
)
## # A tibble: 6 × 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
22 / 24

Uniting

table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite(
data = table5,
col = "year", century, year,
# remove underscore
sep = ""
) %>%
# store as numeric
mutate(year = as.integer(year))
## # A tibble: 6 × 3
## country year rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
23 / 24

Practice tidying data

Download today's in-class exercises from the website.

24 / 24

Agenda

  • HW3: in-class demo (from last lecture in-class materials)

  • Tidy data with tidyr: theory and practice

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow