Option 2: Use R to download from a repository, and proceed with your analysis
The package curl
has several functions that can “see” URLs and download relevant files
ndownloader
link, navigate to your favorite paper’s figshare directory and copy the download linkOption 3: Conduct web-scraping through R
General principle: Retrieve html
content, and “scrape” it to extract information.
Example: “read” data from https://rvest.tidyverse.org/articles/starwars.html into R
R
library(tidyverse)
library(rvest)
url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)
section <- html |> html_elements("section")
section[[1]]
## {html_node}
## <section>
## [1] <h2 data-id="1">\nThe Phantom Menace\n</h2>
## [2] <p>\nReleased: 1999-05-19\n</p>
## [3] <p>\nDirector: <span class="director">George Lucas</span>\n</p>
## [4] <div class="crawl">\n<p>\nTurmoil has engulfed the Galactic Republic. The ...
We can now extract data from the html output.
movies <- section |> html_element("h2") |> html_text2()
movies
## [1] "The Phantom Menace" "Attack of the Clones"
## [3] "Revenge of the Sith" "A New Hope"
## [5] "The Empire Strikes Back" "Return of the Jedi"
## [7] "The Force Awakens"
directors <- section |> html_element(".director") |> html_text2()
directors
## [1] "George Lucas" "George Lucas" "George Lucas" "George Lucas"
## [5] "Irvin Kershner" "Richard Marquand" "J. J. Abrams"
crawl <- section |> html_element(".crawl") |> html_text2()
crawl
## [1] "Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute.\n\nHoping to resolve the matter with a blockade of deadly battleships, the greedy Trade Federation has stopped all shipping to the small planet of Naboo.\n\nWhile the Congress of the Republic endlessly debates this alarming chain of events, the Supreme Chancellor has secretly dispatched two Jedi Knights, the guardians of peace and justice in the galaxy, to settle the conflict…."
## [2] "There is unrest in the Galactic Senate. Several thousand solar systems have declared their intentions to leave the Republic.\n\nThis separatist movement, under the leadership of the mysterious Count Dooku, has made it difficult for the limited number of Jedi Knights to maintain peace and order in the galaxy.\n\nSenator Amidala, the former Queen of Naboo, is returning to the Galactic Senate to vote on the critical issue of creating an ARMY OF THE REPUBLIC to assist the overwhelmed Jedi…."
## [3] "War! The Republic is crumbling under attacks by the ruthless Sith Lord, Count Dooku. There are heroes on both sides. Evil is everywhere.\n\nIn a stunning move, the fiendish droid leader, General Grievous, has swept into the Republic capital and kidnapped Chancellor Palpatine, leader of the Galactic Senate.\n\nAs the Separatist Droid Army attempts to flee the besieged capital with their valuable hostage, two Jedi Knights lead a desperate mission to rescue the captive Chancellor…."
## [4] "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.\n\nDuring the battle, Rebel spies managed to steal secret plans to the Empire’s ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.\n\nPursued by the Empire’s sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy…."
## [5] "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy.\n\nEvading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth.\n\nThe evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space…."
## [6] "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt.\n\nLittle does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star.\n\nWhen completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy…"
## [7] "Luke Skywalker has vanished. In his absence, the sinister FIRST ORDER has risen from the ashes of the Empire and will not rest until Skywalker, the last Jedi, has been destroyed. With the support of the REPUBLIC, General Leia Organa leads a brave RESISTANCE. She is desperate to find her brother Luke and gain his help in restoring peace and justice to the galaxy. Leia has sent her most daring pilot on a secret mission to Jakku, where an old ally has discovered a clue to Luke’s whereabouts…."
Potential limitations of the preceding options:
Alternative Request data from dynamic databases
Application programming interfaces (APIs) are mechanisms for communicating between two computers
http
request.Accessing US Demographic data from the US Census Bureau
Using R
to access an API, with the httr
and jsonlite
packages.
# If needed, install two useful packages
# pak::pak("httr")
# pak::pak("jsonlite")
library("tidyverse")
library("httr")
library("jsonlite")
# Path to US Census API
path <- 'https://api.census.gov/data/2018/acs/acs5'
# To use APIs, we need a good handle on the expected query parameters
# Here, we are building a query for the B19013_001E database
# and asking for data from all counties in state ID 55 (Wisconsin)
# State codes are enumerated at
# https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html
query_params <- list('get' = 'NAME,B19013_001E',
'for' = 'county:*',
'in' = 'state:55')
# UNCOMMENT the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "../10-databases/mdresponse.rds")
response <- readRDS("mdresponse.rds")
response |>
content(as = 'text') |>
fromJSON() |>
as_tibble()
## # A tibble: 73 × 4
## V1 V2 V3 V4
## <chr> <chr> <chr> <chr>
## 1 NAME B19013_001E state county
## 2 Iron County, Wisconsin 40801 55 051
## 3 Clark County, Wisconsin 51872 55 019
## 4 St. Croix County, Wisconsin 81124 55 109
## 5 Oconto County, Wisconsin 57105 55 083
## 6 Lincoln County, Wisconsin 56086 55 069
## 7 Waupaca County, Wisconsin 57680 55 135
## 8 Barron County, Wisconsin 50903 55 005
## 9 Sawyer County, Wisconsin 44555 55 113
## 10 Green Lake County, Wisconsin 53260 55 047
## # ℹ 63 more rows
# We can now progress with this analysis as we choose.
# NOTE that this code block is not executed.
# Retrieve data for four states (ID 1, 12, 32, 55)
multi_state <-
tibble(state_ids = sprintf("%02d", c(1, 12,32, 55))) |>
mutate(path = "https://api.census.gov/data/2018/acs/acs5",
query_params = map(state_ids, \(x)
list('get' = 'NAME,B19013_001E',
'for' = 'county:*',
'in' = paste0('state:',x)))) |>
mutate(response = map2(path, query_params,
\(x,y)
GET(x, query = y)))
# saveRDS(multi_state, file = "10-databases/multistate-database.rds")
multi_state |>
count(state_name)
## # A tibble: 4 × 2
## state_name n
## <chr> <int>
## 1 " Alabama" 67
## 2 " Florida" 67
## 3 " Nevada" 17
## 4 " Wisconsin" 72
multi_state |>
rename(medinc = B19013_001E) |>
mutate(medinc = as.numeric(medinc)) |>
ggplot(aes(x = medinc)) +
geom_density() +
facet_wrap(.~state_name)
path <- 'https://api.gbif.org/v1/occurrence/search'
#
query_params <- list('country' = 'EC',
'elevation' = '500,750',
'acceptedTaxonKey' = '6683') # See https://www.gbif.org/species/ to find taxonKey for your group
# NOTE: Uncomment the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "melastome-database.rds")
View the output (restricted to twenty records for now)
readRDS(file = "melastome-database.rds") |>
content(as = 'text') |>
fromJSON() |>
pluck("results") |>
as_tibble() |>
select(where(is.character)) |>
select(12,27,33,40, 43) |>
knitr::kable() |>
kableExtra::kable_styling(font_size = 10) |>
kableExtra::scroll_box(width = "1000px", height = "500px")
acceptedScientificName | recordedBy | publishedByGbifRegion | municipality | language |
---|---|---|---|---|
Melastomataceae | T. L. P. Couvreur | LATIN_AMERICA | Desconocido | es |
Melastomataceae | Thomas B. Croat|Geneviève Ferry|David Scherberich|Claudia L. Henríquez R. | NORTH_AMERICA | NA | NA |
Melastomataceae | Carmen Ulloa|J. M. Carrión|M. Bustamante | NORTH_AMERICA | NA | NA |
Melastomataceae | Carmen Ulloa|J. M. Carrión|M. Bustamante | NORTH_AMERICA | NA | NA |
Melastomataceae | Carmen Ulloa|J. M. Carrión|M. Bustamante | NORTH_AMERICA | NA | NA |
Melastomataceae | Carmen Ulloa|J. M. Carrión|M. Bustamante | NORTH_AMERICA | NA | NA |
Melastomataceae | L.Torres | LATIN_AMERICA | Ibarra | ES |
Melastomataceae | J. Homeier | LATIN_AMERICA | Desconocido | es |
Melastomataceae | Walter A. Palacios | NORTH_AMERICA | NA | NA |
Melastomataceae | Walter A. Palacios | NORTH_AMERICA | NA | NA |
Melastomataceae | W. Palacios | LATIN_AMERICA | NA | es |
Melastomataceae | Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
Melastomataceae | Camilo Kajekai|Abel Wisum | NORTH_AMERICA | NA | NA |
key
before making requests (e.g. for downloading records from GBIF)You don’t always have to “Hand-code” API queries
In some cases, others have built packages to facilitate API searches.
e.g. tidycensus
for accessing US Census API; rgbif
for the GBIF (Global Biodiversity Information Facility) API
Browse a large list of packages for accessing eco-evo-related APIs through R
here
Each day, PRISM collects data from over 25,000 precipitation and temperature stations. These include all of NOAA’s major networks; those operated by the US Forest Service, Bureau of Reclamation, and USDA; the large CoCoRaHS precipitation network; state and regional systems operating in many parts of the country; and Environment Canada stations. More networks are being added all the time.
R
with the prism
packageWorldclim has global climate data, from 1970–2000
Accessible in R
through the geodata
package
copernicusMarine
packageWorking with long–term databases requires a great deal of care
Data cleaning, maintaining “good hygiene”, not spamming APIs with your requests, etc.
As practice, work through the rGBIF tutorial, which will introduce you to the use of various useful packages for biodiversity informatics in R
(direct link https://training.gbif.org/en/data-use/lemur-catta-tutorial)