Advanced tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

To learn more about the tidyverse, visit https://tidyverse.org/learn/

Work through the R for Data Science book: https://r4ds.hadley.nz/

Overview

tidyverse is a collection of R packages:

  • dplyr – for data manipulation
  • tidyr – for making data “tidy”
  • readr – for efficiently reading data
  • stringr – for working with strings
  • ggplot2 – for data visualization
  • forcats – utility functions for working with factor data
  • lubridate – for working with date/time data
  • purrr – tools for working with functions and vectors
  • tibble – a ‘modern reimagining’ of the data frame

Round-robin tour of some highlights

We will make use of the penguins data frame for this set of exercises

dplyr

The main goal is to make it easier and faster to deal with data and data manipulations

  • Over a hundred functions – lots to learn!

  • A few key categories of functions:

  • mutate amd mutate_*: add columns to a dataframe

  • filter and slice: grab rows

  • group_by: group dataframe for grouped operations

  • summarize: for creating summaries (within groups if appropriate)

  • *_join: for merging dataframes

library(tidyverse)
library(palmerpenguins)

# Create a new column
penguins |> 
  mutate(bill_ratio = bill_length_mm/bill_depth_mm)

# Express bill length and depth in cm and rename columns
penguins |> 
  mutate(across(ends_with("mm"), \(x) x/10)) |> 
  # (use the function `rename` if you want to use custom names, `rename_with` if you want to use a function)
  rename_with(~str_replace(.x, "_mm", "_cm"))

# Select a single column
penguins |> 
  pull(bill_length_mm)

# Note: Contrast this with select(), which returns a tibble:
penguins |> 
  select(bill_length_mm)



# Group by species and calculate summaries
penguins |> 
  group_by(species) |> 
  summarize(mean_billLength = mean(bill_length_mm, na.rm = T),
            sd_billLength = sd(bill_length_mm, na.rm = T))

# Group by species and island
penguins |> 
  group_by(species, island) |> 
  summarize(mean_billLength = mean(bill_length_mm, na.rm = T))

## FOR DEMO ONLY

## Imagine that we have have another data frame that records attributes of the islands:

island_data <-
  tibble(island = unique(penguins$island)) |> 
  mutate(elevation = runif(3,100,500))

# We can merge this into the penguins dataframe:
penguins |> 
  left_join(island_data, by = "island")

tidyr

Main goal is to facilitate tidy data operations (e.g. switching from a wide to long format)


penguins


# Shift to an even longer format, such that each measured aspect is a row

penguins |> 
  mutate(penguin_id = row_number()) |> 
  pivot_longer(cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))

# Create a new column that unites species and island
penguins |> 
  unite("speciesisland", species, island, remove = F)

# Nesting grouped data frames
# We will see later what we can achieve with these nested data frames
penguins |> 
  group_by(species) |> 
  nest()

penguins |> 
  group_by(species, island) |> 
  nest()


# Unnest...
penguins |> 
  group_by(species) |> 
  nest() |> 
  unnest(data)


# Crossing
# This is especially useful if you are setting up an experiment:
crossing(species = letters[1:5],
         soil = c("site 1", "site 2", "site 3"),
         watering =  c("high", "low"),
         block = 1:10) |> 
  arrange(block) |> 
  group_by(block) |> 
  mutate(pot_position = sample(1:n(), n())) |> 
  arrange(block, pot_position)

readr

Main goal is to read in data files

  • There are already base functions in R to do this, e.g. read.csv
  • readr provides alternatives that do some things better, e.g. read_csv
  • But there’s not a strictly “better” option – readr can be nice if are relying on other tidyverse functions

ggplot2

Main goal is to provide a consistent “grammar of graphics”

  • Now a highly used framework for plotting, so lots of extensions available

# install.packages('khroma')
library(khroma)

highcontrast <- color("highcontrast")

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point(aes(color = species), size = 2) + 
  scale_color_manual(values = highcontrast(3)) +
  theme_classic()

## Combining dplyr and ggplot

# Use dplyr to calculate mean and SD of each species
penguins |> 
  group_by(species) |> 
  mutate(mean_bill_length = mean(bill_length_mm, na.rm = T),
         sd_bill_length = sd(bill_length_mm, na.rm = T),
         mean_bill_depth = mean(bill_depth_mm, na.rm = T),
         sd_bill_depth = sd(bill_depth_mm, na.rm = T)) |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + 
  geom_point(size = 1, alpha = 0.6) +
  geom_pointrange(aes(y = mean_bill_depth, 
                      ymin = mean_bill_depth-sd_bill_depth,
                      ymax = mean_bill_depth + sd_bill_depth,
                      x = mean_bill_length, fill = species),
                  size = 3)+ 
  geom_errorbarh(aes(y = mean_bill_depth,
                     x = mean_bill_length,
                      xmin = mean_bill_length-sd_bill_length,
                    xmax = mean_bill_length+sd_bill_length)) +
  scale_color_manual(values = highcontrast(3)) +
  guides(color = guide_legend(override.aes = (list(size = 1)))) + 
  theme_classic()



# Now do it by species and sex
penguins |> 
  na.omit() |> 
  group_by(species, sex) |> 
  mutate(mean_bill_length = mean(bill_length_mm, na.rm = T),
         sd_bill_length = sd(bill_length_mm, na.rm = T),
         mean_bill_depth = mean(bill_depth_mm, na.rm = T),
         sd_bill_depth = sd(bill_depth_mm, na.rm = T)) |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = sex)) + 
  geom_point(size = 1, alpha = 0.6) +
  geom_pointrange(aes(y = mean_bill_depth, 
                      ymin = mean_bill_depth-sd_bill_depth,
                      ymax = mean_bill_depth + sd_bill_depth,
                      x = mean_bill_length),
                  size = 3)+ 
  geom_errorbarh(aes(y = mean_bill_depth,
                     x = mean_bill_length,
                     xmin = mean_bill_length-sd_bill_length,
                     xmax = mean_bill_length+sd_bill_length)) +
  scale_color_manual(values = highcontrast(3)) +
  guides(color = guide_legend(override.aes = (list(size = 1))),
         shape = guide_legend(override.aes = (list(size = 1)))) + 
  theme_classic()

forcats

Designed for helping deal with categorical (factor) data

# By default, factors are leveled alphabetically:
penguins$species |> levels()

# example implication: order of boxplots set alphabetically:

penguins |> 
  na.omit() |> 
  ggplot(aes(y = bill_depth_mm, x = species, fill = sex)) + 
  geom_boxplot(position = position_dodge(width = 0.75)) + 
  scale_fill_manual(values = highcontrast(2)) +
  theme_classic()


# What if we wanted to order this by largest to smallest bill length?

penguins |> 
  na.omit() |>
  # fct_reorder for reordering factors by the bill depth column
  mutate(species = fct_reorder(species, bill_depth_mm, mean)) |> 
  ggplot(aes(y = bill_depth_mm, x = species, fill = sex)) + 
  geom_boxplot(position = position_dodge(width = 0.75)) + 
  scale_fill_manual(values = highcontrast(2)) +
  theme_classic()

# What if we also wanted the order within each species to be male -- female

penguins |> 
  na.omit() |>
  mutate(species = fct_reorder(species, bill_depth_mm, mean)) |> 
  # fct_relevel for manually recoding factor levels
  mutate(sex = fct_relevel(sex, c("male", "female"))) |> 
  ggplot(aes(y = bill_depth_mm, x = species, fill = sex)) + 
  geom_boxplot(position = position_dodge(width = 0.75)) + 
  scale_fill_manual(values = highcontrast(2)) +
  theme_classic()

lubridate

For working with temporal information


precip <- read_csv(url("https://gitlab.com/gklab/teaching/reproducible-research-f25/-/raw/main/11-tidyverse/noaa_honolulu_precip.csv"))
precip |> 
  ggplot(aes(x = DATE, y = PRCP)) +
  geom_point()

# Instead, we can get total precipitation in each month and plot that against time
precip |> 
  mutate(year = year(DATE),
         month = month(DATE)) |> 
  group_by(year, month) |> 
  mutate(total_precip = sum(PRCP)) |> 
  mutate(yearmonth = ym(paste0(year, "-", month))) |> 
  ggplot(aes(x = yearmonth, y = total_precip)) + 
  geom_point() + 
  geom_line() + 
  scale_x_date(date_breaks = "6 months", date_labels = "%b %y") +
  theme_classic()

# We could also get mean total precip per month, without reference to year
precip |> 
  mutate(year = year(DATE),
         month = month(DATE)) |> 
  group_by(month,year) |> 
  mutate(monthly_precip = sum(PRCP)) |> 
  ungroup() |> 
  group_by(month) |> 
  mutate(mean_monthly_precip = mean(monthly_precip)) |> 
  ggplot(aes(x = month, y = mean_monthly_precip)) + 
  geom_point() + 
  geom_line() +
  scale_x_continuous(breaks = 1:12) +
  theme_classic()

purrr

For facilitating functional programming in R.

  • The goal of functional programming is to heavily rely on functions to create useful software

library(purrr)

# Imagine you wish to run a linear regression of body mass as a function of bill depth and bill length:
global_mod <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = penguins)
summary(global_mod)
plot(global_mod)
# decent model -- but diagnostics aren't great


# What if we add a species interaction term?
global_mod2 <- lm(body_mass_g ~ (bill_length_mm+bill_depth_mm)*species, data = penguins)
summary(global_mod2)
plot(global_mod2)

# This feels better --the diagnostics look good
# But, this might feel harder to make sense of -- lots of interaction terms
# So you might want to fit a separate model per species.

# One option is to do so one-by-one:

gentoo <- penguins |> filter(species == "Gentoo")
gentoo_model <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = gentoo)
adelie <- penguins |> filter(species == "Adelie")
adelie_model <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = adelie)
chinstrap <- penguins |> filter(species == "Chinstrap")
chinstrap_model <- lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = chinstrap)

# Now you can look at these models individually...

summary(gentoo_model)
summary(adelie_model)
summary(chinstrap_model)

## But... this mechanism can get tedious

## Mapping to the rescue

penguins |> 
  group_by(species) |> 
  nest() |> 
  mutate(linmod = map(data, \(x)
                      lm(body_mass_g ~ bill_length_mm+bill_depth_mm, data = x))) |> 
  mutate(linmod_summary = map(linmod, summary)) |> 
  pull(linmod_summary)

stringr

For working with string data

penguins |> 
  mutate(is_Adelie = str_detect(species, "Adel"))

# Use regular expressions 
penguins |> 
  mutate(species_first_letter = str_extract(species, "."))

Application to an ecological problem.

Use packages from the tidyverse to create the graph on the following slide, showing the abundance distributions of birds in University Lake near LSU.

Note: In addition to the core tidyverse packages, I used the package glue

To acquire the dataset, please use the following command:

ebird_out <- readRDS(url("https://gitlab.com/gklab/teaching/reproducible-research-f25/-/raw/main/11-tidyverse/ebird_dat.rds"))

Important attributes:

  • Each panel is 1 month in 2023
  • Each panel shows the abundance of the 20 most abundant species recorded that month at University Lakes
  • Each panel also shows the total number of species, as part of the title

Challenge

Modify this code to also show Shannon diversity for each month in the title.

Recall that Shannon diversity is calculated as:

\[H' = -\sum_{i=1}^{S} p_i*\text{ln}(p_i)\]

Hint: the following function returns the Shannon diversity for a dataframe (assuming the dataframe has a column named avg_obs). You are welcome to use it in your solution.

calculate_shannon <- function(df) {
  obs_vec <- df$avg_obs
  total_obs = sum(obs_vec)
  pi = obs_vec/total_obs
  log_pi = log(pi)
  round(-1*(sum(pi*log_pi)), 2)
}

Exercises

Lots of resources to learn more: