Advanced tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

To learn more about the tidyverse, visit https://tidyverse.org/learn/

Work through the R for Data Science book: https://r4ds.hadley.nz/

Overview

tidyverse is a collection of R packages:

dplyr – for data manipulation
tidyr – for making data “tidy”
readr – for efficiently reading data
stringr – for working with strings
ggplot2 – for data visualization
forcats – utility functions for working with factor data
lubridate – for working with date/time data
purrr – tools for working with functions and vectors
tibble – a ‘modern reimagining’ of the data frame

Round-robin tour of some highlights

We will make use of the penguins data frame for this set of exercises

`dplyr`

The main goal is to make it easier and faster to deal with data and data manipulations

Over a hundred functions – lots to learn!
A few key categories of functions:
mutate amd mutate_*: add columns to a dataframe
filter and slice: grab rows
group_by: group dataframe for grouped operations
summarize: for creating summaries (within groups if appropriate)
*_join: for merging dataframes

library(tidyverse)
library(palmerpenguins)

# Create a new column
penguins |> 
  mutate(bill_ratio = bill_length_mm/bill_depth_mm)

# Express bill length and depth in cm and rename columns
penguins |> 
  mutate(across(ends_with("mm"), \(x) x/10)) |> 
  # (use the function `rename` if you want to use custom names, `rename_with` if you want to use a function)
  rename_with(~str_replace(.x, "_mm", "_cm"))

# Select a single column
penguins |> 
  pull(bill_length_mm)

# Note: Contrast this with select(), which returns a tibble:
penguins |> 
  select(bill_length_mm)



# Group by species and calculate summaries
penguins |> 
  group_by(species) |> 
  summarize(mean_billLength = mean(bill_length_mm, na.rm = T),
            sd_billLength = sd(bill_length_mm, na.rm = T))

# Group by species and island
penguins |> 
  group_by(species, island) |> 
  summarize(mean_billLength = mean(bill_length_mm, na.rm = T))

## FOR DEMO ONLY

## Imagine that we have have another data frame that records attributes of the islands:

island_data <-
  tibble(island = unique(penguins$island)) |> 
  mutate(elevation = runif(3,100,500))

# We can merge this into the penguins dataframe:
penguins |> 
  left_join(island_data, by = "island")

`tidyr`

Main goal is to facilitate tidy data operations (e.g. switching from a wide to long format)


penguins


# Shift to an even longer format, such that each measured aspect is a row

penguins |> 
  mutate(penguin_id = row_number()) |> 
  pivot_longer(cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))

# Create a new column that unites species and island
penguins |> 
  unite("speciesisland", species, island, remove = F)

# Nesting grouped data frames
# We will see later what we can achieve with these nested data frames
penguins |> 
  group_by(species) |> 
  nest()

penguins |> 
  group_by(species, island) |> 
  nest()


# Unnest...
penguins |> 
  group_by(species) |> 
  nest() |> 
  unnest(data)


# Crossing
# This is especially useful if you are setting up an experiment:
crossing(species = letters[1:5],
         soil = c("site 1", "site 2", "site 3"),
         watering =  c("high", "low"),
         block = 1:10) |> 
  arrange(block) |> 
  group_by(block) |> 
  mutate(pot_position = sample(1:n(), n())) |> 
  arrange(block, pot_position)

`readr`

Main goal is to read in data files

There are already base functions in R to do this, e.g. read.csv
readr provides alternatives that do some things better, e.g. read_csv
But there’s not a strictly “better” option – readr can be nice if are relying on other tidyverse functions

`ggplot2`

Main goal is to provide a consistent “grammar of graphics”

Now a highly used framework for plotting, so lots of extensions available

# install.packages('khroma')
library(khroma)

highcontrast <- color("highcontrast")

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point(aes(color = species), size = 2) + 
  scale_color_manual(values = highcontrast(3)) +
  theme_classic()

## Combining dplyr and ggplot

# Use dplyr to calculate mean and SD of each species
penguins |> 
  group_by(species) |> 
  mutate(mean_bill_length = mean(bill_length_mm, na.rm = T),
         sd_bill_length = sd(bill_length_mm, na.rm = T),
         mean_bill_depth = mean(bill_depth_mm, na.rm = T),
         sd_bill_depth = sd(bill_depth_mm, na.rm = T)) |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + 
  geom_point(size = 1, alpha = 0.6) +
  geom_pointrange(aes(y = mean_bill_depth, 
                      ymin = mean_bill_depth-sd_bill_depth,
                      ymax = mean_bill_depth + sd_bill_depth,
                      x = mean_bill_length, fill = species),
                  size = 3)+ 
  geom_errorbarh(aes(y = mean_bill_depth,
                     x = mean_bill_length,
                      xmin = mean_bill_length-sd_bill_length,
                    xmax = mean_bill_length+sd_bill_length)) +
  scale_color_manual(values = highcontrast(3)) +
  guides(color = guide_legend(override.aes = (list(size = 1)))) + 
  theme_classic()



# Now do it by species and sex
penguins |> 
  na.omit() |> 
  group_by(species, sex) |> 
  mutate(mean_bill_length = mean(bill_length_mm, na.rm = T),
         sd_bill_length = sd(bill_length_mm, na.rm = T),
         mean_bill_depth = mean(bill_depth_mm, na.rm = T),
         sd_bill_depth = sd(bill_depth_mm, na.rm = T)) |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = sex)) + 
  geom_point(size = 1, alpha = 0.6) +
  geom_pointrange(aes(y = mean_bill_depth, 
                      ymin = mean_bill_depth-sd_bill_depth,
                      ymax = mean_bill_depth + sd_bill_depth,
                      x = mean_bill_length),
                  size = 3)+ 
  geom_errorbarh(aes(y = mean_bill_depth,
                     x = mean_bill_length,
                     xmin = mean_bill_length-sd_bill_length,
                     xmax = mean_bill_length+sd_bill_length)) +
  scale_color_manual(values = highcontrast(3)) +
  guides(color = guide_legend(override.aes = (list(size = 1))),
         shape = guide_legend(override.aes = (list(size = 1)))) + 
  theme_classic()

`forcats`

Designed for helping deal with categorical (factor) data

# By default, factors are leveled alphabetically:
penguins$species |> levels()

# example implication: order of boxplots set alphabetically:

penguins |> 
  na.omit() |> 
  ggplot(aes(y = bill_depth_mm, x = species, fill = sex)) + 
  geom_boxplot(position = position_dodge(width = 0.75)) + 
  scale_fill_manual(values = highcontrast(2)) +
  theme_classic()


# What if we wanted to order this by largest to smallest bill length?

penguins |> 
  na.omit() |>
  # fct_reorder for reordering factors by the bill depth column
  mutate(species = fct_reorder(species, bill_depth_mm, mean)) |> 
  ggplot(aes(y = bill_depth_mm, x = species, fill = sex)) + 
  geom_boxplot(position = position_dodge(width = 0.75)) + 
  scale_fill_manual(values = highcontrast(2)) +
  theme_classic()

# What if we also wanted the order within each species to be male -- female

penguins |> 
  na.omit() |>
  mutate(species = fct_reorder(species, bill_depth_mm, mean)) |> 
  # fct_relevel for manually recoding factor levels
  mutate(sex = fct_relevel(sex, c("male", "female"))) |> 
  ggplot(aes(y = bill_depth_mm, x = species, fill = sex)) + 
  geom_boxplot(position = position_dodge(width = 0.75)) + 
  scale_fill_manual(values = highcontrast(2)) +
  theme_classic()

`lubridate`

For working with temporal information


precip <- read_csv(url("https://gitlab.com/gklab/teaching/reproducible-research-f25/-/raw/main/11-tidyverse/noaa_honolulu_precip.csv"))
precip |> 
  ggplot(aes(x = DATE, y = PRCP)) +
  geom_point()

# Instead, we can get total precipitation in each month and plot that against time
precip |> 
  mutate(year = year(DATE),
         month = month(DATE)) |> 
  group_by(year, month) |> 
  mutate(total_precip = sum(PRCP)) |> 
  mutate(yearmonth = ym(paste0(year, "-", month))) |> 
  ggplot(aes(x = yearmonth, y = total_precip)) + 
  geom_point() + 
  geom_line() + 
  scale_x_date(date_breaks = "6 months", date_labels = "%b %y") +
  theme_classic()

# We could also get mean total precip per month, without reference to year
precip |> 
  mutate(year = year(DATE),
         month = month(DATE)) |> 
  group_by(month,year) |> 
  mutate(monthly_precip = sum(PRCP)) |> 
  ungroup() |> 
  group_by(month) |> 
  mutate(mean_monthly_precip = mean(monthly_precip)) |> 
  ggplot(aes(x = month, y = mean_monthly_precip)) + 
  geom_point() + 
  geom_line() +
  scale_x_continuous(breaks = 1:12) +
  theme_classic()

`purrr`

For facilitating functional programming in R.

The goal of functional programming is to heavily rely on functions to create useful software

library(purrr)

# Imagine you wish to run a linear regression of body mass as a function of bill depth and bill length:
global_mod <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = penguins)
summary(global_mod)
plot(global_mod)
# decent model -- but diagnostics aren't great


# What if we add a species interaction term?
global_mod2 <- lm(body_mass_g ~ (bill_length_mm+bill_depth_mm)*species, data = penguins)
summary(global_mod2)
plot(global_mod2)

# This feels better --the diagnostics look good
# But, this might feel harder to make sense of -- lots of interaction terms
# So you might want to fit a separate model per species.

# One option is to do so one-by-one:

gentoo <- penguins |> filter(species == "Gentoo")
gentoo_model <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = gentoo)
adelie <- penguins |> filter(species == "Adelie")
adelie_model <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = adelie)
chinstrap <- penguins |> filter(species == "Chinstrap")
chinstrap_model <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = chinstrap)

# Now you can look at these models individually...

summary(gentoo_model)
summary(adelie_model)
summary(chinstrap_model)

## But... this mechanism can get tedious

## Mapping to the rescue

penguins |> 
  group_by(species) |> 
  nest() |> 
  mutate(linmod = map(data, \(x)
                      lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = x))) |> 
  mutate(linmod_summary = map(linmod, summary)) |> 
  pull(linmod_summary)

`stringr`

For working with string data

penguins |> 
  mutate(is_Adelie = str_detect(species, "Adel"))

# Use regular expressions 
penguins |> 
  mutate(species_first_letter = str_extract(species, "."))

Application to an ecological problem.

Use packages from the tidyverse to create the graph on the following slide, showing the abundance distributions of birds in University Lake near LSU.

Note: In addition to the core tidyverse packages, I used the package glue

To acquire the dataset, please use the following command:

ebird_out <- readRDS(url("https://gitlab.com/gklab/teaching/reproducible-research-f25/-/raw/main/11-tidyverse/ebird_dat.rds"))

Important attributes:

Each panel is 1 month in 2023
Each panel shows the abundance of the 20 most abundant species recorded that month at University Lakes
Each panel also shows the total number of species, as part of the title

Challenge

Modify this code to also show Shannon diversity for each month in the title.

Recall that Shannon diversity is calculated as:

\[H' = -\sum_{i=1}^{S} p_i*\text{ln}(p_i)\]

Hint: the following function returns the Shannon diversity for a dataframe (assuming the dataframe has a column named avg_obs). You are welcome to use it in your solution.

calculate_shannon <- function(df) {
  obs_vec <- df$avg_obs
  total_obs = sum(obs_vec)
  pi = obs_vec/total_obs
  log_pi = log(pi)
  round(-1*(sum(pi*log_pi)), 2)
}

Exercises

Lots of resources to learn more:

For more on dplyr: https://r4ds.hadley.nz/data-transform.html (See also Ch. 5)
For more on ggplot2: https://r4ds.hadley.nz/layers.html (See also Ch. 10,11)
For more on tidyr: https://r4ds.hadley.nz/data-tidy.html
For more on forcats: https://r4ds.hadley.nz/factors.html
For more on lubridate: https://r4ds.hadley.nz/datetimes.html
For more on purrr: https://r4ds.hadley.nz/program.html (See chapters within this section)
For more on stringr: https://r4ds.hadley.nz/strings.html