Interacting with open-source databases through R

Accessing data through R

Option 1: Download from a repository, read into R.

# Navigate to website:
# https://figshare.com/articles/dataset/When_and_where_plant-soil_feedback_may_promote_plant_coexistence_A_meta-analysis_Supplementary_Table_1_File_containing_raw_data_and_citations_for_the_meta-analysis/7985195/1
# Download and save file


data <- read_csv("/path/to/file.csv")

Accessing data through R

Option 2: Use R to download from a repository, and proceed with your analysis
The package curl has several functions that can “see” URLs and download relevant files

# E.g. download  a single file: 
curl::curl_download("https://figshare.com/ndownloader/files/14874749", 
                    destfile = "abc.xlsx")

# Or download a whole zip file
curl::curl_download("https://figshare.com/ndownloader/articles/13140146/versions/1", 
                    destfile = "xyz.zip")

To get figshare ndownloader link, navigate to your favorite paper’s figshare directory and copy the download link
Can be also done for Zenodo
Useful if you are using a published dataset for your analysis and don’t want to duplicate it in your project archive.

Accessing data through R

Option 3: Conduct web-scraping through R
General principle: Retrieve html content, and “scrape” it to extract information.
Example: “read” data from https://rvest.tidyverse.org/articles/starwars.html into R

Web-scraping through `R`

library(tidyverse)
library(rvest)

url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)

section <- html |> html_elements("section")
section[[1]]
## {html_node}
## <section>
## [1] <h2 data-id="1">\nThe Phantom Menace\n</h2>
## [2] <p>\nReleased: 1999-05-19\n</p>
## [3] <p>\nDirector: <span class="director">George Lucas</span>\n</p>
## [4] <div class="crawl">\n<p>\nTurmoil has engulfed the Galactic Republic. The ...

We can now extract data from the html output.

movies <- section |> html_element("h2") |> html_text2()
movies
## [1] "The Phantom Menace"      "Attack of the Clones"   
## [3] "Revenge of the Sith"     "A New Hope"             
## [5] "The Empire Strikes Back" "Return of the Jedi"     
## [7] "The Force Awakens"

directors <- section |> html_element(".director") |> html_text2()
directors
## [1] "George Lucas"     "George Lucas"     "George Lucas"     "George Lucas"    
## [5] "Irvin Kershner"   "Richard Marquand" "J. J. Abrams"

crawl <- section |> html_element(".crawl") |> html_text2()
crawl
## [1] "Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute.\n\nHoping to resolve the matter with a blockade of deadly battleships, the greedy Trade Federation has stopped all shipping to the small planet of Naboo.\n\nWhile the Congress of the Republic endlessly debates this alarming chain of events, the Supreme Chancellor has secretly dispatched two Jedi Knights, the guardians of peace and justice in the galaxy, to settle the conflict…."           
## [2] "There is unrest in the Galactic Senate. Several thousand solar systems have declared their intentions to leave the Republic.\n\nThis separatist movement, under the leadership of the mysterious Count Dooku, has made it difficult for the limited number of Jedi Knights to maintain peace and order in the galaxy.\n\nSenator Amidala, the former Queen of Naboo, is returning to the Galactic Senate to vote on the critical issue of creating an ARMY OF THE REPUBLIC to assist the overwhelmed Jedi…."             
## [3] "War! The Republic is crumbling under attacks by the ruthless Sith Lord, Count Dooku. There are heroes on both sides. Evil is everywhere.\n\nIn a stunning move, the fiendish droid leader, General Grievous, has swept into the Republic capital and kidnapped Chancellor Palpatine, leader of the Galactic Senate.\n\nAs the Separatist Droid Army attempts to flee the besieged capital with their valuable hostage, two Jedi Knights lead a desperate mission to rescue the captive Chancellor…."                     
## [4] "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.\n\nDuring the battle, Rebel spies managed to steal secret plans to the Empire’s ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.\n\nPursued by the Empire’s sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy…."
## [5] "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy.\n\nEvading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth.\n\nThe evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space…."                 
## [6] "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt.\n\nLittle does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star.\n\nWhen completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy…"                                          
## [7] "Luke Skywalker has vanished. In his absence, the sinister FIRST ORDER has risen from the ashes of the Empire and will not rest until Skywalker, the last Jedi, has been destroyed. With the support of the REPUBLIC, General Leia Organa leads a brave RESISTANCE. She is desperate to find her brother Luke and gain his help in restoring peace and justice to the galaxy. Leia has sent her most daring pilot on a secret mission to Jakku, where an old ally has discovered a clue to Luke’s whereabouts…."

Accessing data through R

Potential limitations of the preceding options:

Static data product
Run into some limitations with large files
If data download is happening outside of a script, limited reproducibility
For more detail, see Web Scraping in R for Data Science for details

Alternative Request data from dynamic databases

Data product need not be static (e.g. regularly updated database of all records in iNaturalist)
Queries are happening through systems designed for data, so better for large files
Reproducible data provenance, as the exact query can be recreated

Accesing databases with using an API

Application programming interfaces (APIs) are mechanisms for communicating between two computers

You use APIs on a daily basis
e.g. Weather app on your phone is connecting with a remote database through an API
Several types of APIs, we’ll focus on “REST”¹ APIs.
APIs as an alternative to “GUIs” (Graphical user interfaces, where you point-and-click on options)
To interact with an API, we need a properly formatted http request.

Example

Accessing US Demographic data from the US Census Bureau

Detailed instructions at this link
Guided workthrough at this link

Using R to access an API, with the httr and jsonlite packages.

# If needed, install two useful packages 
# pak::pak("httr")
# pak::pak("jsonlite")

library("tidyverse")
library("httr")
library("jsonlite")

# Path to US Census API
path <- 'https://api.census.gov/data/2018/acs/acs5'

# To use APIs, we need a good handle on the expected query parameters
# Here, we are building a query for the B19013_001E database
# and asking for data from all counties in state ID 55 (Wisconsin)
# State codes are enumerated at
# https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html
query_params <- list('get' = 'NAME,B19013_001E', 
                     'for' = 'county:*',
                     'in' = 'state:55') 

# UNCOMMENT the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "../10-databases/mdresponse.rds")
response <- readRDS("mdresponse.rds")
response |>
 content(as = 'text') |> 
 fromJSON() |> 
  as_tibble()
## # A tibble: 73 × 4
##    V1                           V2          V3    V4    
##    <chr>                        <chr>       <chr> <chr> 
##  1 NAME                         B19013_001E state county
##  2 Iron County, Wisconsin       40801       55    051   
##  3 Clark County, Wisconsin      51872       55    019   
##  4 St. Croix County, Wisconsin  81124       55    109   
##  5 Oconto County, Wisconsin     57105       55    083   
##  6 Lincoln County, Wisconsin    56086       55    069   
##  7 Waupaca County, Wisconsin    57680       55    135   
##  8 Barron County, Wisconsin     50903       55    005   
##  9 Sawyer County, Wisconsin     44555       55    113   
## 10 Green Lake County, Wisconsin 53260       55    047   
## # ℹ 63 more rows

# We can now progress with this analysis as we choose.

Programmatically download data

# NOTE that this code block is not executed.
# Retrieve data for four states (ID 1, 12, 32, 55)
multi_state <- 
tibble(state_ids = sprintf("%02d", c(1, 12,32, 55))) |>
  mutate(path = "https://api.census.gov/data/2018/acs/acs5",
         query_params = map(state_ids, \(x) 
                     list('get' = 'NAME,B19013_001E', 
                          'for' = 'county:*',
                          'in' = paste0('state:',x)))) |> 
  mutate(response = map2(path, query_params,
                         \(x,y)
                         GET(x, query = y)))
# saveRDS(multi_state, file = "10-databases/multistate-database.rds")

Programmatically download data

multi_state <- readRDS("multistate-database.rds")

multi_state <-
  multi_state |> 
  mutate(data = map(response, \(x)
                    x |> 
                      content(as = "text") |> 
                      fromJSON() |> 
                      as_tibble() |> 
                      janitor::row_to_names(1))) |> 
  unnest(data) |> 
  mutate(state_name = str_extract(NAME, "\\s(\\w+)$"))

Programmatically download data

multi_state |> 
  count(state_name)
## # A tibble: 4 × 2
##   state_name       n
##   <chr>        <int>
## 1 " Alabama"      67
## 2 " Florida"      67
## 3 " Nevada"       17
## 4 " Wisconsin"    72

multi_state |> 
  rename(medinc = B19013_001E) |> 
  mutate(medinc = as.numeric(medinc)) |> 
  ggplot(aes(x = medinc)) + 
  geom_density() + 
  facet_wrap(.~state_name)

Another example

GBIF has an incredible database for species occurrence data
Collates information from various sources - e.g. eBird, iNaturalist, and much more (see here) for complete list
Data are accessible through GBIF’s API
Different APIs for accessing occurrence data, retrieving individual records, etc.

Another example

Using GBIF to look for all plants in the family Melastomataceae recorded in Ecuador between 500-750 meters
See documentation for details on how to use this API.

path <- 'https://api.gbif.org/v1/occurrence/search'

# 
query_params <- list('country' = 'EC', 
                     'elevation' = '500,750',
                     'acceptedTaxonKey' = '6683') # See https://www.gbif.org/species/ to find taxonKey for your group

# NOTE: Uncomment the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "melastome-database.rds")

View the output (restricted to twenty records for now)

readRDS(file = "melastome-database.rds") |>
 content(as = 'text') |> 
 fromJSON() |> 
  pluck("results") |> 
  as_tibble() |> 
  select(where(is.character)) |>
  select(12,27,33,40, 43) |> 
  knitr::kable() |> 
  kableExtra::kable_styling(font_size = 10) |> 
  kableExtra::scroll_box(width = "1000px", height = "500px")

acceptedScientificName	recordedBy	publishedByGbifRegion	municipality	language
Melastomataceae	T. L. P. Couvreur	LATIN_AMERICA	Desconocido	es
Melastomataceae	Thomas B. Croat\|Geneviève Ferry\|David Scherberich\|Claudia L. Henríquez R.	NORTH_AMERICA	NA	NA
Melastomataceae	Carmen Ulloa\|J. M. Carrión\|M. Bustamante	NORTH_AMERICA	NA	NA
Melastomataceae	Carmen Ulloa\|J. M. Carrión\|M. Bustamante	NORTH_AMERICA	NA	NA
Melastomataceae	Carmen Ulloa\|J. M. Carrión\|M. Bustamante	NORTH_AMERICA	NA	NA
Melastomataceae	Carmen Ulloa\|J. M. Carrión\|M. Bustamante	NORTH_AMERICA	NA	NA
Melastomataceae	L.Torres	LATIN_AMERICA	Ibarra	ES
Melastomataceae	J. Homeier	LATIN_AMERICA	Desconocido	es
Melastomataceae	Walter A. Palacios	NORTH_AMERICA	NA	NA
Melastomataceae	Walter A. Palacios	NORTH_AMERICA	NA	NA
Melastomataceae	W. Palacios	LATIN_AMERICA	NA	es
Melastomataceae	Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA
Melastomataceae	Camilo Kajekai\|Abel Wisum	NORTH_AMERICA	NA	NA

Guidelines for APIs

Well-maintained APIs have documentation on query parameters - get familiar with these!
e.g. US EPA API guidlines here

Many APIs require that you sign up for a key before making requests (e.g. for downloading records from GBIF)
This to ensure that API servers aren’t “spammed” by requests

R Packages for working with APIs

You don’t always have to “Hand-code” API queries
In some cases, others have built packages to facilitate API searches.
e.g. tidycensus for accessing US Census API; rgbif for the GBIF (Global Biodiversity Information Facility) API
Browse a large list of packages for accessing eco-evo-related APIs through R here

Other sources of data that can be accessed through R

PRISM network for Climate data in continental US

Each day, PRISM collects data from over 25,000 precipitation and temperature stations. These include all of NOAA’s major networks; those operated by the US Forest Service, Bureau of Reclamation, and USDA; the large CoCoRaHS precipitation network; state and regional systems operating in many parts of the country; and Environment Canada stations. More networks are being added all the time.

Directly accessible through R with the prism package

Other sources of data that can be accessed through R

Worldclim has global climate data, from 1970–2000
Accessible in R through the geodata package
- This population is also useful for work on other spatial ecology: human footprint index; population density; landcover; soil properties; openstreetmaps

Other sources of data that can be accessed through R

Sea Surface Temperature data through Copernicus (also has global atmospheric and land use data)
Accessible through the copernicusMarine package

Exercise

Working with long–term databases requires a great deal of care
Data cleaning, maintaining “good hygiene”, not spamming APIs with your requests, etc.
As practice, work through the rGBIF tutorial, which will introduce you to the use of various useful packages for biodiversity informatics in R (direct link https://training.gbif.org/en/data-use/lemur-catta-tutorial)

Interacting with open-source databases through R

Accessing data through R

Accessing data through R

Accessing data through R

Web-scraping through R

Accessing data through R

Accesing databases with using an API

Example

Programmatically download data

Programmatically download data

Programmatically download data

Another example

Another example

Guidelines for APIs

R Packages for working with APIs

Other sources of data that can be accessed through R

Other sources of data that can be accessed through R

Other sources of data that can be accessed through R

Exercise

Web-scraping through `R`