The life cycle of data, from generation/collection to archival

Activity 2 round robin

Plant Physiology

You may self-archive versions of your work on your own webpages, on institutional webpages, and in other repositories under different conditions and at different times, depending on the version archived. If you want more information about the reuse rights you retain if you publish with us, please visit our Author Self-Archiving Policy page.

Implementation

Paper 1 > Supplementary data for enzyme parameters, temperature correction factors, AI segmentation settings, subcellular volumes calculations, and metabolite concentrations (link). R Shiny App code used for the Michaelis-Menten kinetic simulations; link

Paper 2

Some data provided in Supplement; No code or scripts were explicitly linked

Paper 3

summary data, figures, and accession numbers/sequences in supplmentary data; No analysis code or staistical scripts were explicity linked

Paper 4

Some data provided in Supplement; No code or scripts were explicitly linked

Paper 5

Some data provided in Supplement; No code or scripts were explicitly linked

Journal of Geophysical Research Planets

Journal policy

AGU requires that the underlying data needed to understand, evaluate, and build upon the reported research be available at the time of peer review and publication. For more information and guidance, visit the Data and Software for Authors page.

Implementation

Paper 1

all instrument data is released through the NASA Planetary Data System; Derived regions of inteterst spectra used in the parameter analysis are available under the supporting inforation tab; Code availability not applicable

Paper 2

Data (and code?) available upon request

Paper 3

spectral data and derived elemental maps are archived at the PDS Geosciences Node; analysis outputs (but no code?) available published in the paper

Paper 4

Mastcam-Z Science Calibrated Data Records are archived at the PDS Imaging Node; analysis detailed in the manuscript (but no code?)

Paper 5

CRISM datasets are available NASA Planetary Data system; no specific analysis code or scripts are explicitly linked. Any methodolodgy is detailed in the manuscript

Science

Journal policy

Code and Data Statements All data used in the analysis must be available to any researcher for purposes of reproducing or extending the analysis. Data must be available in the paper or deposited in a community special-purpose repository or a general-purpose repository such as Dryad (see Data and Code Deposition). Exceptional circumstances requiring special treatment, such as protection of personal privacy or purchase of datasets from third-party sources, should be discussed with the editor as early as possible (no later than at the manuscript revision stage) and spelled out explicitly in the acknowledgments. Problems in obtaining access to published data are taken seriously by the Science journals and can be reported at science_data@aaas.org.

Analytic methods (code) transparency In general, all computer code central to the findings being reported should be available to readers to ensure reproducibility. If the software used is commercially available or the source code is already publicly archived, it should be referenced in an appropriately formatted citation (with the version included, if necessary for accurate replication). Author-written source code that is not yet publicly available should be archived in a permanent public repository prior to publication and likewise cited (see Data and Code Deposition). In exceptional cases where for example, security concerns, legal restrictions, or proprietary hardware preclude sharing of custom code, an alternate means of ensuring reproducibility must be arranged with the editor. Our preferred option in such exceptional cases is for the Materials and methods section to include pseudocode that fully clarifies the underlying algorithms; this pseudocode will be subject to peer review and may require further elaboration in accord with reviewers’ feedback. As with data, the reason(s) for the code restrictions in these special cases should be explained clearly in the acknowledgments.

Note that this is the policy across Science journals run by AAAS, but some details may differ by journal (e.g. Science vs. Science Advances)

Implementation

Paper 1

Genomic data at European Nucleotide Archvie (akin to NCBI); Genome assemblies on Dryad; CENevo analysis on Github some software available on BitBucket; some code from github archived on Zenodo (but “some files hidden”)

Paper 2

Field measurements, morphodynamic modeling outputs, etc. hosted in an institutional repository (University in Italy). Some data are in the supplement; some data were downloaded from governmental databases in Italy.

Paper 3

Some field and modeling data on US Governmental website (USGS Sciencebase.gov); no mention of code?

Paper 4

Genomic data deposited to NCBI (US Government/NIH-run repo); some data in supplement; code generated for analysis but doesn’t seem to be archived?

Paper 5

Museum specimens, archival numbers mentioned; R code for RevBayes archived in Dryad; 3D fossil models in Morphosource (but unable to find the relevant entries).

Nature

Journal Policy

A condition of publication in a Nature Portfolio journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript. Authors of research articles in the life sciences, behavio& social sciences and ecology, evolution & environmental sciences are required to provide details about elements of experimental and analytical design that are frequently poorly reported in a reporting summary that will be made available to editors and reviewers during manuscript assessment. The reporting summary will be published with all accepted manuscripts.

Nature journals undertake peer review of custom code or mathematical algorithm, and software, when it is central to the manuscript. For code/algorithm peer review, we require sharing of code/algorithm during the peer review process with editors and reviewers and inclusion of a Code Availability section in the manuscript upon publication. Journals offer an optional free service to support authors in sharing their code for peer review and publication via Code Ocean.

Implementation

Paper 1

Some data files available on Figshare, some available from authors upon “reasonable” request; some code on Figshare; some code not?

Paper 2

Lots of publicly available data (NDVI, global saltmarsh extent, tide gauge, etc.) from SoilGrids and Coastal Carbon Research Coordination Network repositories; some data in supplement; R and python code on Github

Paper 3

Meta-analysis; meta-dataset and analysis code on Figshare

Paper 4

Genomic data on NCBI; some additional data archived on Zenodo; code archived on Zenodo

Paper 5

Data archived on WSU Research Exchange (but link not loading?)

Frontiers

Journal policy

“Frontiers is a gold open access publisher. At the point of publication, all articles from our portfolio of journals are immediately and permanently accessible online free of charge. Frontiers’ articles, Research Topics, and ebooks are published under the CC-BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and the source are credited.G old open access publishing services, which include end-to-end support – from research integrity and peer review, through to innovations in technology and artificial intelligence as well as global dissemination – are supported through Article Processing Charges (APCs). APCs enable the long-term stability of our program, and facilitate equal opportunity to seek, share and create knowledge that benefits all society without restriction. Quality is maintained through rigorous peer review and support through technological innovations, such as our Artificial Intelligence Review Assistant (AIRA).Open access to the results of publicly funded research is of huge value, offering significant social and economic benefits. The open access publishing model, defined by the Berlin Agreement in 2003, improves the pace, efficiency, and value of research to society. By its very nature, it improves the visibility of authors’ work and enables better scholarly exchange, and therefore, the potential impact of that work.By eliminating the barriers that block the free distribution of knowledge, open access enables scientists to collaborate better, innovate faster, and deliver the solutions we need for healthy lives on a healthy planet.”

In fact this is the open access publication statement but doesn’t say anything about data or code availability.

Implementation

Paper 1

PDF File with observational data; no code made available

Paper 2

Supplemental data file; no code made available

Paper 3

In this paper, the authors provided a statement saying that the data will be available in the near future. No code made available.

Paper 4

Only images in PDF format were made public. Each image contained each figure used in the paper. No code made available

Paper 5

No data was available, but a statement from the authors was published: requests to access the datasets should be directed to [email address]. Code not available.

Fish and Fisheries

Journal policy

Fish and Fisheries expects that data supporting the results in the paper will be archived in an appropriate public repository. Authors are required to provide a data availability statement to describe the availability or the absence of shared data. When data have been shared, authors are required to include in their data availability statement a link to the repository they have used, and to cite the data they have shared. Whenever possible the scripts and other artefacts used to generate the analyses presented in the paper should also be publicly archived. If sharing data compromises ethical standards or legal requirements then authors are not expected to share it.

Implementation

Paper 1

Data archived on Centre for Environment Fisheries and Aquaculture Science

Paper 2

Code and data on github

Paper 3

Code and data on Figshare

Paper 4

Data on Figshare, code on Github

Paper 5

Data and code on Github

Fisheries Research

Journal policy

To foster transparency, you are required to state the availability of any data at submission. Ensuring data is available may be a requirement of your funding body or institution. If your data is unavailable to access or unsuitable to post, you can state the reason why (e.g., your research data includes sensitive or confidential information such as patient data) during the submission process. This statement will appear with your published article on ScienceDirect.

Implementation

Paper 1

Code and data available as supplemental files to the paper

Paper 2

Supplemental results in Supplement; no code made available?

Paper 3

Code and data available on Dryad

Paper 4

Data will be made available on request (presumably this applies to Code?)

Paper 5

Data will be made available on request (presumably this applies to Code?)

Proceedings B

Journal policy

We require supporting data and information, including source code and other digital research materials, to be made available at the time of submission so that reviewers and Data Editors can assess your work and confirm that the archive is useful and complete. This is in line with our policies to promote greater openness in scientific research and to allow, as well as encourage, other researchers to perform full replications of published studies. For more information please refer to our data sharing policies. In order to make it as easy as possible to comply with this policy, the Proceedings B submission system is fully integrated with the Dryad data repository. We also cover the cost of submitting data to Dryad.

Implementation

Paper 1

Data and code on Dryad

Paper 2

Data and code on Dryad

Paper 3

Data on Dryad; no code found

Paper 4

Data on Figshare; no code found

Paper 5

Data on Dryad; no code found

Integative and Comparative Biology

Journal policy

Where ethically feasible, Integrative & Comparative Biology encourages authors to make all data and software code on which the conclusions of the paper rely available to readers. Data availability allows verification of results, re- and meta-analysis, and allows other researchers to build on existing results without duplicating effort unnecessarily.

We suggest that summary data are presented in the main manuscript with more detailed information available as additional supporting files (Supplementary Data), or deposited in a public repository whenever possible. For information on general repositories for all data types, please see a list of recommended repositories by subject area.

Where specialised, subject-specific public repositories are available, we encourage authors to deposit data in these facilities and make the availability of data clear in the published paper. Otherwise, authors can upload their datasets as ‘supplementary data’ with their paper for publication. Where neither of these options is feasible, authors are required to make data available upon reasonable request.

Implementation

Paper 1

Data and code archived on Zenodo

Paper 2

Data and code archived on Dryad

Paper 3

Data available in supplement; no code available

Paper 4

Data on Figshare; no code available

Paper 5

Data and code available upon request

Journal of Mammology

Journal policy

Where ethically feasible, the Journal strongly encourages authors to make all data and software code on which the conclusions of the paper rely available to readers. Whenever possible, data should be presented in the main manuscript or additional supporting files or deposited in a public repository. Visit OUP’s Research data page for information on general repositories for all data types, and resources for selecting repositories by subject area. When data and software underlying the research article are available in an online source, authors should include a full citation in their reference list. For details of the minimum information to be included in data and software citations see the OUP guidance on citing research data and software.

All DNA sequence data must be submitted to GenBank before the paper can be published. In addition, all alignments (regardless of whether or not they include insertion/deletion events or manual adjustments) must be submitted to GenBank, Dryad, TreeBASE, or included as Supplementary Data to be archived with the published manuscript. Museum catalog numbers for all voucher specimens (including associated tissue) examined must be included in the manuscript (in an Appendix if numerous). Any newly collected raw qualitative or quantitative morphological data must be submitted as either Supplementary Data or deposited to a third-party public repository. Summary statistics, e.g., means, must also include sample sizes, standard deviations, range, etc., to facilitate future analyses.

Implementation

Paper 1

Morphological data in supplement; DNA sequences in genbank; UCEs on Zenodo; no code available

Paper 2

DNA sequences on Genbank; morphological data not available; code not available

Paper 3

Sequence data on Genbank and Zenodo; no code available

Paper 4

Climate data sourced from PRISM but no statement regarding availability of novel data/code

Paper 5

Data available on Morphospace; no code available

Journal of Molecular Ecology

Journal policy

Molecular Ecology Resources supports open research, therefore as a condition for publication, requires that the data supporting the results in the paper will be archived in an appropriate public repository. Authors are required to adhere to the guidelines outlined in this viewpoint article when archiving their data.

Upon submission, the journal requires authors to provide all data, metadata and code for review by editors and referees. A data accessibility statement will be required during submission, which clearly outlines where data has been deposited. Authors are encouraged to use Private for Peer Review features, if offered by the repository, for the review process. At acceptance, data must be formally archived and the Data Accessibility Statement completed with permanent links to all data from the manuscript.

Implementation

Paper 1

Code and data on Dryad

Paper 2

Genomic data available on NCBI and soome as supplemental; no mention of code in availability statement

Paper 3

Genomic data on NCBI; some data on Zenodo; no mention of code in availability statement

Paper 4

Annotated genome on NCBI; no mention of code in availability statement

Paper 5

Some genomic data on European Nucleotide Archive; some ASV data on gBif; code on Zenodo

Some other trends

Some journals now have strict policies and enforcement
- e.g. Ecology Letters and American Naturalist have Data Editors who check each accepted manuscript for replicability
- Ecology is soon implementing this; Ecology also has Data Papers that count as standard “papers”
- British Ecological Society is formally encouraging qmd files demonstrating reproducibility (link)
- NSF-funded investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF awards.
Rapidly evolving standards, and you have the power to influence the future
- e.g. Request access to code and data while conducting peer review
- e.g. Set exemplars for colleagues through your work

Summary

State of the field 10 years ago:

We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse.

Anyone want to use their methodology for last 10 years?

Principles guiding data archival

“FAIR” Data principles

“FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals”

“This article describes four foundational principles – Findability, Accessibility, Interoperability, and Reusability – that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added-value gained by contemporary, formal scholarly digital publishing.

Importantly, it is our intent that the principles apply not only to ‘data’ in the conventional sense, but also to the algorithms, tools, and workflows that led to that data.”

FAIR guiding princples

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

“These high-level FAIR Guiding Principles precede implementation choices, and do not suggest any specific technology, standard, or implementation-solution; moreover, the Principles are not, themselves, a standard or a specification.

Another way to think about it - the outcome isn’t necessarily “FAIR” or “not FAIR” archives; but rather more or less “FAIR”

FAIR principles are commonly adapted

Many journal policies explicitly point to these
NSF’s Data Management Plans encourage defining plans relative to FAIR principles

Findability

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

What is a “Searchable resource”?
How to assign a globally unique and persistent identifier?
What is “sufficient” meta-data?

Databases for archiving

From our Activity 2 submissions, we saw that ecologists and evolutionary biologists frequently upload their data (including code) on a variety of places – supplement/appendix to paper, Dryad, Figshare, Zenodo, Institutional repositories, Github, etc.

Not all of these are equal.
Following the FAIR principles, the dataset should be assigned a globally unique and persistent identifier.
- Github and paper’s supplementary materials don’t achieve this – and they shouldn’t, e.g. the owner can delete a repository any time. Persistence is not guaranteed.
- Each dataset should be assigned a DOI (Digital Object Identifier)
- Each DOI is a unique link, which across the whole of the internet points only to this one place.
- DOIs are automatically generated, and are persistent, meaning that anything assigned a DOI is more or less “permanently” available at that address.
- DOIs are assigned by ISO, an international agency overseeing standars organization

Databases for archiving

Within ecology and evolution, Dryad, Figshare, and Zenodo are commonly used archival repositories (for non-sequence data).
- Institutional repositories, Open Science Framework, etc. are also common
For certain types of data (e.g. long sequence reads, individual barcode sequences, protein sequences or 3D builds of proteins), there are established databases that you should become familiar with.
If you use any of these, you’re doing great.
If you are worried about which to use, look to your journal to understand the norms in your field.
Don’t use Github (or Gitlab) as an “Archive”

Accessibility

Data, once archived should be easy to access

If you are using one of the archival databases, this isn’t a concern.

Interoperable

Humans (and computers) should be able to exchange and interpret each other’s data.
Use file formats that are general across all computers, operating systems, etc. and are freely available
- e.g. Use csv files instead of xls files for spreadsheets (and especially not pdf files!)
Store data in reasonable units – if there are “field standards”, use those; otherwise, have clear meta-data
For “Big data” in ecology - consider using the Ecological Metadata Language to document your work.

Reusable

Legal reusability

In addition to being technically reusable, data should be legally reusable
When you publish data, clarify what are the usage rights by including a license
- E.g. some licenses might allow anyone to use, modify, remix your data, without even acknowledging its source.
- Another license might allow users to modify and remix your data, but need to cite the source material and may use it for non-commercial uses (i.e. can’t “sell” a new product)
- https://choosealicense.com/ If you want to explore good options

Practical reusability

For someone to reuse your data, they need to know its provenance
I.e. How the data were generated in the first place: who, how, when, why, where, etc.

Updates to FAIR data standards

CARE data standards for Indigenous Data Governance
Frontiers journals are trying to pilot “Fair²” data practices for allowing datasets to be more easily incorporated into AI pipelines

For this class

At the end of the semester, you will create a permanent archive of your semester project replication on Zenodo.