File system and project management

Reproducible research in the news

Reproducible research in the news

Direct link

Pre-class checkins:

  • Software Carpentry exercise on using the Bash Shell: link. (Complete Lessons 1–3).

  • Gitlab account and repository setup

    • Your git account should have a publicly view-able repository of your in-class notes
    • This repository should contain an .Rproj file, along with a qmd file for you to take notes in class.

Further reading on today’s material

  • Chapter 6 of R for Data Science, available here.

  • Communicate Data With R chapter on RProjects, available here

File paths

We start with the basics: understanding how files are stored on your machine.

Why start here?

Most analysis scripts begin with something like:

my_data <- read.csv("~/Desktop/gklab/research/psf-kadumane/data/biomass.csv")
  • If your script starts this way, it is unlikely to run seemlessly on any other computer!

Understanding file management

  • Start thinking in terms of nested hierarchies

Alternative visualization

# install.packges('fs') # if needed
library(fs)
dir_tree(path = "~", recurse = 0)
~
├── 3D Objects
├── AppData
├── Application Data
├── Contacts
├── Cookies
├── Desktop
├── Documents
├── Downloads
├── Favorites
├── Links
├── Local Settings
├── Music
├── My Documents
├── NetHood
├── NTUSER.DAT
├── ntuser.dat.LOG1
├── ntuser.dat.LOG2
├── NTUSER.DAT{53b39e88-18c4-11ea-a811-000d3aa4692b}.TM.blf
├── NTUSER.DAT{53b39e88-18c4-11ea-a811-000d3aa4692b}.TMContainer00000000000000000001.regtrans-ms
├── NTUSER.DAT{53b39e88-18c4-11ea-a811-000d3aa4692b}.TMContainer00000000000000000002.regtrans-ms
├── ntuser.ini
├── OneDrive
├── Pictures
├── PrintHood
├── Recent
├── Saved Games
├── Searches
├── SendTo
├── Start Menu
├── Sti_Trace.log
├── Templates
└── Videos

Gaurav’s setup

fs::dir_tree("~/Desktop/gklab", 0)
~/Desktop/gklab
├── admin
├── gaurav
├── gauravsk.gitlab.io
├── gklab.org
├── grants
├── lab-handbook
├── lab-notebooks
├── lab-resources
├── letters
├── mentees
├── quartoutils
├── research
├── talks
├── teaching
└── todo

Gaurav’s setup

fs::dir_tree("~/Desktop/gklab/research", 0)
~/Desktop/gklab/research
├── archive
├── coffee-psmi
├── dragnet
├── ebird-mtn
├── fastplants
├── fire-pheno
├── grants
├── pmi-globalchange
├── primer-soilmicrobes
├── psf-delay
├── psf-kadumane
├── qcb-survey
├── reproducing-analyses
├── sculptors
├── sculptors-psf
├── sculptors_with_julia
├── sentiment-survey
├── trait-psf-metaanalysis
└── tropical-phenology

Gaurav’s setup

fs::dir_tree("~/Desktop/gklab/research", 1)
~/Desktop/gklab/research
├── archive
│   ├── ajb-specialissue
│   ├── ajb-synthesis
│   ├── caring
│   ├── DemographicReviewPSF
│   ├── ecoevoapps
│   ├── ecoevoapps-survey-analysis
│   ├── revised-Krishnadas et al-Intro-special issue_am.docx
│   └── spsf-3-wrapups.txt
├── coffee-psmi
│   ├── coffee-psmi.Rproj
│   └── data
├── dragnet
│   ├── aboretum.jpeg
│   ├── archive
│   ├── code
│   ├── data
│   ├── datasheets
│   ├── design
│   ├── dragnet.Rproj
│   ├── site-setup
│   └── _quarto.yml
├── ebird-mtn
│   ├── ebd-datafile-SAMPLE
│   ├── ebd-datafile-SAMPLE.zip
│   ├── ebd_norbob_201901_201912_relFeb-2020
│   ├── ebd_norbob_201901_201912_relFeb-2020.zip
│   ├── ebird-mtn.Rproj
│   ├── Gmba_Inventory_v2.0_Selection_Tool_20200330.xlsx
│   ├── GMBA_Inventory_v2.0_standard_300
│   ├── GMBA_Inventory_v2.0_standard_300.zip
│   ├── renv
│   ├── renv.lock
│   ├── surv_haz.docx
│   └── where-are-mountains.R
├── fastplants
│   ├── code
│   ├── code-outputs
│   ├── data
│   ├── fastplants.Rproj
│   ├── gen1-labels.csv
│   ├── gen2-pot_maps.csv
│   ├── gen2-pot_tags-arranged.csv
│   ├── gen2-pot_tags.csv
│   ├── literature
│   └── Rplots.pdf
├── fire-pheno
│   └── explore_plants-gsk.pdf
├── grants
│   └── lagniappe-proposal
├── pmi-globalchange
│   ├── code
│   ├── data
│   ├── literature
│   ├── manuscript
│   ├── pmi-globalchange.Rproj
│   └── README.md
├── primer-soilmicrobes
│   ├── 09302024-primer-proposal.pdf
│   ├── admin-files
│   ├── Community_assembly_diagram.png
│   ├── glossary.html
│   ├── glossary.qmd
│   ├── manuscript.docx
│   ├── manuscript.pdf
│   ├── manuscript.qmd
│   ├── Microbes_mediate_community_assembly.png
│   ├── microbes_mediate_env_filters.png
│   ├── primer-soilmicrobes.Rproj
│   ├── references.bib
│   ├── versions-for-feedback
│   └── _quarto.yml
├── psf-delay
│   ├── acfe_gk.png
│   ├── acpl_gk.png
│   ├── acpl_imm_gk.png
│   ├── analysis.R
│   ├── archive
│   ├── data
│   ├── fepl_gk.png
│   ├── fepl_leg_gk.png
│   ├── figs
│   ├── figs-for-talk.R
│   ├── PSFeta.Rproj
│   └── README.md
├── psf-kadumane
│   ├── alpha_plot_with_significance_bars.png
│   ├── combined_IGR_outcomes_gk_plot.png
│   ├── data
│   ├── diagnostic_plots_Litsea_floribunda_Conspecific.png
│   ├── diagnostic_plots_Litsea_floribunda_Heterospecific.png
│   ├── diagnostic_plots_Litsea_floribunda_NA.png
│   ├── diagnostic_plots_Symplocos_racemosa_Conspecific.png
│   ├── diagnostic_plots_Symplocos_racemosa_Heterospecific.png
│   ├── diagnostic_plots_Symplocos_racemosa_NA.png
│   ├── esa-data-exploration.R
│   ├── experimental-setup
│   ├── figures
│   ├── plots_models_coexistence.Rmd
│   ├── plots_models_competition.Rmd
│   ├── psf-kadumane.Rproj
│   ├── readme.md
│   ├── scripts
│   ├── supp_field_ref.Rmd
│   ├── verify-plots_models_coexistence.pdf
│   ├── verify-plots_models_coexistence.qmd
│   └── verify-plot_models_competition.R
├── qcb-survey
│   ├── data
│   ├── data-summary.R
│   ├── densitygrams.pdf
│   ├── explore-s25-data.R
│   ├── for-siarm.r
│   ├── qcb-survey-questions.docx
│   ├── qcb-survey.html
│   ├── qcb-survey.pdf
│   ├── qcb-survey.qmd
│   ├── qcb-survey.Rproj
│   ├── responsdens.png
│   ├── results-7dec24-genderdiff.png
│   ├── results-7dec24.png
│   └── _quarto.yml
├── reproducing-analyses
│   ├── box2-rescaling.pdf
│   ├── crawford-2020
│   ├── DGRP_16S_MaleLH-main
│   ├── jiang-2024
│   ├── ke-levine-2021
│   ├── Krishnadas-etal-fig1.png
│   ├── lau-2012
│   ├── microbwater.png
│   ├── one-off
│   ├── psf-time
│   ├── reproducing-analyses.Rproj
│   ├── sedg-drought.png
│   ├── sedg.R
│   ├── sedgwick-CWM-demography
│   └── wateronly.png
├── sculptors
│   ├── admin
│   ├── pau-geb13747-sup-0001-supinfo.zip
│   ├── plot.pdf
│   ├── plot.png
│   ├── README.md
│   ├── refs
│   ├── Rplot.png
│   ├── Rplot01.png
│   ├── sculptors.Rproj
│   ├── simulation
│   ├── site selection
│   ├── writing
│   └── _quarto.yml
├── sculptors-psf
│   ├── experiment-setup
│   └── sculptors-psf.Rproj
├── sculptors_with_julia
│   ├── figs
│   ├── initial-exploration
│   ├── julia-code
│   ├── sculptors_with_julia.Rproj
│   ├── simulation-papers
│   └── writing
├── sentiment-survey
│   └── qualtrics-survey-updated.pdf
├── trait-psf-metaanalysis
│   ├── admin-files
│   ├── carmona-data
│   ├── code
│   ├── data
│   ├── figures
│   ├── manuscript
│   ├── models
│   ├── predict_hfa.csv
│   ├── stats
│   ├── trait-psf-metaAnalysis.Rproj
│   └── _quarto.yml
└── tropical-phenology
    ├── bib.library
    ├── biotropica_lab_project.Rproj
    ├── Biotropica_phenology.Rproj
    ├── code
    ├── data
    ├── inat.html
    ├── Observations · iNaturalist.html
    ├── README.md
    ├── shapefiles
    ├── sp.list.R
    └── urbanness.R

Exercise

Consider the following directory structure:

Home/
├── coursework
│   ├── course1
│   ├── course2
│   └── course3
└── research
    ├── project1
    ├── project2
    └── project3

You are currently working within the directory ~/research/project1, but you realize it is time to review your notes for Course 2. What command can you use to navigate into the course2 directory?

Exercise

Consider the following directory structure:

Home/
├── coursework
│   ├── course1
│   ├── course2
│   └── course3
└── research
    ├── project1
    ├── project2
    └── project3

You are currently working within the project1 directory, but your collaborator asks about a file related to your Project 3. Without changing directories, what command could you run to list all the files in the project3 directory?

Exercise

Consider the following directory structure:

Home/
├── coursework
│   ├── course1
│   ├── course2
│   └── course3
└── research
    ├── project1
    ├── project2
    └── project3

You had written a script “cool-analysis.R” for your research Project 2 which you now realize is relevant for the Fall 2025 Reproducible Research course. Assuming you are currently in the Home directory, what command could you use to copy cool-analysis.R from the project2 directory into the f25-repro-res directory?

Exercise

Consider the following directory structure:

Home/
├── coursework
│   ├── course1
│   ├── course2
│   └── course3
└── research
    ├── project1
    ├── project2
    └── project3

You got back 252 fastq sequence files from the first big sequencing run of your dissertation. Congrats! In your excitement, you downloaded and stored these under project3 even though they have to do with Project 1. From within the project1 directory, how could you move all 252 of these files over from project3?

Why are we starting the semester with file paths?

  • Many analysis scripts begin with something like this:
my_data <- read.csv("~/Desktop/gklab/research/psf-kadumane/data/biomass.csv")
  • You wouldn’t be able to run this on your computers without mucking around!

  • In a reproducible analysis context, if you can’t read in the dataset, you probably will give up with the reproduction.

Better project management within RStudio

  • RStudio’s Projects feature helps maintain workflows, especially as you accumulate many parallel projects.

  • Inside an RStudio Project, you will only ever use Relative paths

    • This means that you can simply share an RProject directory with collaborators (or across your own computers) and rerun code without having to worry about paths.

Better project management with RStudio

  • Live coding example
    • Creating new projects
    • Anatomy of a project: .Rproj file, home directory, etc.
    • Switching between projects

Other aspects of project management

  • Internal organization
  • File names
  • README files

Internal organization

  • For any given project, conduct a “project audit” to think through the types of files that this work might entail

    • e.g. for even a “simple” plant ecology project, you might have text files for project brainstorming/literature reviews, spatial data files for locating field plots, scripts for planning experimental/sampling design, flat data files for trait measurements, sequence files, analysis scripts, figures, tables, text files for manuscripts, presentation files, (and likely more).

    • Your project’s internal structure should be set up to accommodate the complexity of that particular project.

More on this on Thursday

File names

  • As the previous slide hinted at…
    there will be a lot of files!

  • File naming as a strategy to manage the chaos

  • Principles for good file names:

  • Human readable
  • Machine readable
  • Play well with default ordering

File names: Human readable

JW7d^(2sl@*.csv

✔️plot1 fire temps 03 Feb 2025.csv

File names: Machine readable

plot 1 fire temps. csv

✔️plot1-fire-temps-3Feb2025.csv

File names: Play well with default ordering

plot1-fire-temps-3Feb2025.csv

✔️ 2025-02-03-plot1-fire-temps.csv

Whole-class exercise

  • At this point, each student should have a functional “remote” git repository (i.e. on gitlab) for in class notes, connected to a directory that lives on your “local” drive (i.e. your laptop)

  • This repository should be structured as an RStudio .Rproject.

  • It should have a human- and machine-readable name that helps orient any readers to its general contents. (e.g. gkandlikar-reprores-notes)

  • Within it, files should have human- and machine-readable names, e.g. week2-notes.qmd.

  • Exercise: Use the bash commands you have learned so far to (re)structure your git repository to be maximally functional.

  • If you need to rename the git repository from your original name, you can follow the steps in this video