Interacting with open-source databases through R

Accessing data through R

  • Option 1: Download from a repository, read into R.
# Navigate to website:
# https://figshare.com/articles/dataset/When_and_where_plant-soil_feedback_may_promote_plant_coexistence_A_meta-analysis_Supplementary_Table_1_File_containing_raw_data_and_citations_for_the_meta-analysis/7985195/1
# Download and save file


data <- read_csv("/path/to/file.csv")

Accessing data through R

  • Option 2: Use R to download from a repository, and proceed with your analysis

  • The package curl has several functions that can “see” URLs and download relevant files

# E.g. download  a single file: 
curl::curl_download("https://figshare.com/ndownloader/files/14874749", 
                    destfile = "abc.xlsx")

# Or download a whole zip file
curl::curl_download("https://figshare.com/ndownloader/articles/13140146/versions/1", 
                    destfile = "xyz.zip")
  • To get figshare ndownloader link, navigate to your favorite paper’s figshare directory and copy the download link
  • Can be also done for Zenodo
  • Useful if you are using a published dataset for your analysis and don’t want to duplicate it in your project archive.

Accessing data through R

Web-scraping through R

library(tidyverse)
library(rvest)

url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)

section <- html |> html_elements("section")
section[[1]]
## {html_node}
## <section>
## [1] <h2 data-id="1">\nThe Phantom Menace\n</h2>
## [2] <p>\nReleased: 1999-05-19\n</p>
## [3] <p>\nDirector: <span class="director">George Lucas</span>\n</p>
## [4] <div class="crawl">\n<p>\nTurmoil has engulfed the Galactic Republic. The ...

We can now extract data from the html output.

movies <- section |> html_element("h2") |> html_text2()
movies
## [1] "The Phantom Menace"      "Attack of the Clones"   
## [3] "Revenge of the Sith"     "A New Hope"             
## [5] "The Empire Strikes Back" "Return of the Jedi"     
## [7] "The Force Awakens"

directors <- section |> html_element(".director") |> html_text2()
directors
## [1] "George Lucas"     "George Lucas"     "George Lucas"     "George Lucas"    
## [5] "Irvin Kershner"   "Richard Marquand" "J. J. Abrams"

crawl <- section |> html_element(".crawl") |> html_text2()
crawl
## [1] "Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute.\n\nHoping to resolve the matter with a blockade of deadly battleships, the greedy Trade Federation has stopped all shipping to the small planet of Naboo.\n\nWhile the Congress of the Republic endlessly debates this alarming chain of events, the Supreme Chancellor has secretly dispatched two Jedi Knights, the guardians of peace and justice in the galaxy, to settle the conflict…."           
## [2] "There is unrest in the Galactic Senate. Several thousand solar systems have declared their intentions to leave the Republic.\n\nThis separatist movement, under the leadership of the mysterious Count Dooku, has made it difficult for the limited number of Jedi Knights to maintain peace and order in the galaxy.\n\nSenator Amidala, the former Queen of Naboo, is returning to the Galactic Senate to vote on the critical issue of creating an ARMY OF THE REPUBLIC to assist the overwhelmed Jedi…."             
## [3] "War! The Republic is crumbling under attacks by the ruthless Sith Lord, Count Dooku. There are heroes on both sides. Evil is everywhere.\n\nIn a stunning move, the fiendish droid leader, General Grievous, has swept into the Republic capital and kidnapped Chancellor Palpatine, leader of the Galactic Senate.\n\nAs the Separatist Droid Army attempts to flee the besieged capital with their valuable hostage, two Jedi Knights lead a desperate mission to rescue the captive Chancellor…."                     
## [4] "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.\n\nDuring the battle, Rebel spies managed to steal secret plans to the Empire’s ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.\n\nPursued by the Empire’s sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy…."
## [5] "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy.\n\nEvading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth.\n\nThe evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space…."                 
## [6] "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt.\n\nLittle does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star.\n\nWhen completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy…"                                          
## [7] "Luke Skywalker has vanished. In his absence, the sinister FIRST ORDER has risen from the ashes of the Empire and will not rest until Skywalker, the last Jedi, has been destroyed. With the support of the REPUBLIC, General Leia Organa leads a brave RESISTANCE. She is desperate to find her brother Luke and gain his help in restoring peace and justice to the galaxy. Leia has sent her most daring pilot on a secret mission to Jakku, where an old ally has discovered a clue to Luke’s whereabouts…."

Accessing data through R

Potential limitations of the preceding options:

  • Static data product
  • Run into some limitations with large files
  • If data download is happening outside of a script, limited reproducibility
  • For more detail, see Web Scraping in R for Data Science for details

Alternative Request data from dynamic databases

  • Data product need not be static (e.g. regularly updated database of all records in iNaturalist)
  • Queries are happening through systems designed for data, so better for large files
  • Reproducible data provenance, as the exact query can be recreated

Accesing databases with using an API

Application programming interfaces (APIs) are mechanisms for communicating between two computers

  • You use APIs on a daily basis
  • e.g. Weather app on your phone is connecting with a remote database through an API
  • Several types of APIs, we’ll focus on “REST”1 APIs.
  • APIs as an alternative to “GUIs” (Graphical user interfaces, where you point-and-click on options)
  • To interact with an API, we need a properly formatted http request.

Example

Accessing US Demographic data from the US Census Bureau

Using R to access an API, with the httr and jsonlite packages.

# If needed, install two useful packages 
# pak::pak("httr")
# pak::pak("jsonlite")

library("tidyverse")
library("httr")
library("jsonlite")

# Path to US Census API
path <- 'https://api.census.gov/data/2018/acs/acs5'

# To use APIs, we need a good handle on the expected query parameters
# Here, we are building a query for the B19013_001E database
# and asking for data from all counties in state ID 55 (Wisconsin)
# State codes are enumerated at
# https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html
query_params <- list('get' = 'NAME,B19013_001E', 
                     'for' = 'county:*',
                     'in' = 'state:55') 

# UNCOMMENT the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "../10-databases/mdresponse.rds")
response <- readRDS("mdresponse.rds")
response |>
 content(as = 'text') |> 
 fromJSON() |> 
  as_tibble()
## # A tibble: 73 × 4
##    V1                           V2          V3    V4    
##    <chr>                        <chr>       <chr> <chr> 
##  1 NAME                         B19013_001E state county
##  2 Iron County, Wisconsin       40801       55    051   
##  3 Clark County, Wisconsin      51872       55    019   
##  4 St. Croix County, Wisconsin  81124       55    109   
##  5 Oconto County, Wisconsin     57105       55    083   
##  6 Lincoln County, Wisconsin    56086       55    069   
##  7 Waupaca County, Wisconsin    57680       55    135   
##  8 Barron County, Wisconsin     50903       55    005   
##  9 Sawyer County, Wisconsin     44555       55    113   
## 10 Green Lake County, Wisconsin 53260       55    047   
## # ℹ 63 more rows

# We can now progress with this analysis as we choose.

Programmatically download data

# NOTE that this code block is not executed.
# Retrieve data for four states (ID 1, 12, 32, 55)
multi_state <- 
tibble(state_ids = sprintf("%02d", c(1, 12,32, 55))) |>
  mutate(path = "https://api.census.gov/data/2018/acs/acs5",
         query_params = map(state_ids, \(x) 
                     list('get' = 'NAME,B19013_001E', 
                          'for' = 'county:*',
                          'in' = paste0('state:',x)))) |> 
  mutate(response = map2(path, query_params,
                         \(x,y)
                         GET(x, query = y)))
# saveRDS(multi_state, file = "10-databases/multistate-database.rds")

Programmatically download data

multi_state <- readRDS("multistate-database.rds")

multi_state <-
  multi_state |> 
  mutate(data = map(response, \(x)
                    x |> 
                      content(as = "text") |> 
                      fromJSON() |> 
                      as_tibble() |> 
                      janitor::row_to_names(1))) |> 
  unnest(data) |> 
  mutate(state_name = str_extract(NAME, "\\s(\\w+)$")) 

Programmatically download data

multi_state |> 
  count(state_name)
## # A tibble: 4 × 2
##   state_name       n
##   <chr>        <int>
## 1 " Alabama"      67
## 2 " Florida"      67
## 3 " Nevada"       17
## 4 " Wisconsin"    72

multi_state |> 
  rename(medinc = B19013_001E) |> 
  mutate(medinc = as.numeric(medinc)) |> 
  ggplot(aes(x = medinc)) + 
  geom_density() + 
  facet_wrap(.~state_name)

Another example

  • GBIF has an incredible database for species occurrence data
  • Collates information from various sources - e.g. eBird, iNaturalist, and much more (see here) for complete list
  • Data are accessible through GBIF’s API
  • Different APIs for accessing occurrence data, retrieving individual records, etc.

Another example

  • Using GBIF to look for all plants in the family Melastomataceae recorded in Ecuador between 500-750 meters
  • See documentation for details on how to use this API.
path <- 'https://api.gbif.org/v1/occurrence/search'

# 
query_params <- list('country' = 'EC', 
                     'elevation' = '500,750',
                     'acceptedTaxonKey' = '6683') # See https://www.gbif.org/species/ to find taxonKey for your group

# NOTE: Uncomment the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "melastome-database.rds")

View the output (restricted to twenty records for now)

readRDS(file = "melastome-database.rds") |>
 content(as = 'text') |> 
 fromJSON() |> 
  pluck("results") |> 
  as_tibble() |> 
  select(where(is.character)) |>
  select(12,27,33,40, 43) |> 
  knitr::kable() |> 
  kableExtra::kable_styling(font_size = 10) |> 
  kableExtra::scroll_box(width = "1000px", height = "500px") 
acceptedScientificName recordedBy publishedByGbifRegion municipality language
Melastomataceae T. L. P. Couvreur LATIN_AMERICA Desconocido es
Melastomataceae Thomas B. Croat&#124;Geneviève Ferry&#124;David Scherberich&#124;Claudia L. Henríquez R. NORTH_AMERICA NA NA
Melastomataceae Carmen Ulloa&#124;J. M. Carrión&#124;M. Bustamante NORTH_AMERICA NA NA
Melastomataceae Carmen Ulloa&#124;J. M. Carrión&#124;M. Bustamante NORTH_AMERICA NA NA
Melastomataceae Carmen Ulloa&#124;J. M. Carrión&#124;M. Bustamante NORTH_AMERICA NA NA
Melastomataceae Carmen Ulloa&#124;J. M. Carrión&#124;M. Bustamante NORTH_AMERICA NA NA
Melastomataceae L.Torres LATIN_AMERICA Ibarra ES
Melastomataceae J. Homeier LATIN_AMERICA Desconocido es
Melastomataceae Walter A. Palacios NORTH_AMERICA NA NA
Melastomataceae Walter A. Palacios NORTH_AMERICA NA NA
Melastomataceae W. Palacios LATIN_AMERICA NA es
Melastomataceae Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA
Melastomataceae Camilo Kajekai&#124;Abel Wisum NORTH_AMERICA NA NA

Guidelines for APIs

  • Well-maintained APIs have documentation on query parameters - get familiar with these!
  • e.g. US EPA API guidlines here


  • Many APIs require that you sign up for a key before making requests (e.g. for downloading records from GBIF)
  • This to ensure that API servers aren’t “spammed” by requests

R Packages for working with APIs

  • You don’t always have to “Hand-code” API queries

  • In some cases, others have built packages to facilitate API searches.

  • e.g. tidycensus for accessing US Census API; rgbif for the GBIF (Global Biodiversity Information Facility) API

  • Browse a large list of packages for accessing eco-evo-related APIs through R here

Other sources of data that can be accessed through R

  • PRISM network for Climate data in continental US

Each day, PRISM collects data from over 25,000 precipitation and temperature stations. These include all of NOAA’s major networks; those operated by the US Forest Service, Bureau of Reclamation, and USDA; the large CoCoRaHS precipitation network; state and regional systems operating in many parts of the country; and Environment Canada stations. More networks are being added all the time.

  • Directly accessible through R with the prism package

Other sources of data that can be accessed through R

  • Worldclim has global climate data, from 1970–2000

  • Accessible in R through the geodata package

    • This population is also useful for work on other spatial ecology: human footprint index; population density; landcover; soil properties; openstreetmaps

Other sources of data that can be accessed through R

  • Sea Surface Temperature data through Copernicus (also has global atmospheric and land use data)
  • Accessible through the copernicusMarine package

Exercise

  • Working with long–term databases requires a great deal of care

  • Data cleaning, maintaining “good hygiene”, not spamming APIs with your requests, etc.

  • As practice, work through the rGBIF tutorial, which will introduce you to the use of various useful packages for biodiversity informatics in R (direct link https://training.gbif.org/en/data-use/lemur-catta-tutorial)