Skip to content

elgabbas/global_sampling_efforts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Global Sampling Effort Dataset from GBIF

This repository provides a reproducible R-based workflow for quantifying, analysing, and visualising spatial sampling effort derived from GBIF occurrence records at a global scale.


Citation

This repository accompanies the following manuscript:

El-Gabbas, A. (2026) A global, taxon-stratified, high-resolution sampling-effort dataset from GBIF for bias-aware ecological modelling. Diversity and Distributions. (accepted)

If you use the code, pre-computed raster products, or derived outputs in your work, please cite the publication above.


What this repository provides

This repository delivers two complementary components:

  1. Pre-computed raster grids of global sampling effort — summarising species counts and record counts per grid cell. These are accessible programmatically via the ecokit::get_sampling_effort() function (see Examples below). These rasters quantify spatial sampling effort across taxonomic groups, spatial grains, and temporal subsets.
  2. A fully reproducible analysis pipeline for regenerating the sampling effort products from raw GBIF data.

Most users will only need to download and analyse the pre-computed rasters rather than re-run the full processing workflow.


Repository structure

File Description
run_bias.R Entry-point script for the full analysis pipeline
run_bias.slurm Example SLURM batch script for HPC execution on a high-performance computing (HPC) cluster
process_effort.R Core function that reproduces the paper’s main results (spatial aggregation and bias metrics)
helper_functions.R Supporting utilities for data parsing, aggregation, masking, and plotting
example_*.qmd Worked examples demonstrating data access and visualisation

Running the full analysis pipeline (HPC)

The full workflow is computationally intensive — it performs global-scale rasterisation across multiple taxonomic groups, spatial resolutions, and years — and is designed to run on an HPC cluster.

To execute the pipeline using SLURM:

sbatch run_bias.slurm

This will:

  • process raw occurrence data,
  • aggregate records across spatial resolutions,
  • compute sampling effort and bias metrics, and
  • generate summary outputs.

Note: The pipeline is provided for full reproducibility. Most users will prefer to work directly with the pre-computed raster products (see below).


Accessing pre-computed sampling effort data

The ecokit::get_sampling_effort() function downloads pre-computed raster products from the associated OSF project. It validates all inputs and retrieves the corresponding GeoTIFF file(s) into a local directory.

Parameters:

Parameter Description
group Taxonomic group (e.g., "aves", "mammalia"). See Table 1 in the manuscript for all available groups.
descendants "all" or a specific descendant clade. See Table 1 in the manuscript for available options per group.
metric Effort metric: "n_sp" (number of species) or "n_obs" (number of observations).
years "total" for all years combined, or specific year(s) in 1980–2025.
resolution Spatial resolution in km (1, 5, 10, or 20).
out_dir Local directory for downloaded files.

The function validates all inputs (group, descendant, metric, temporal subset, and resolution) and retrieves the corresponding GeoTIFF file(s) from the OSF archive into a local directory.

Available taxonomic groups

Group group value Example descendants
Amphibia "amphibia" Anura, Caudata, Gymnophiona
Arachnida "arachnida" Araneae, Scorpiones, Opiliones, …
Aves (birds) "aves" Passeriformes, Accipitriformes, Anseriformes, …
Fungi "fungi" Agaricomycetes, Lecanoromycetes, …
Insecta "insecta" Coleoptera, Lepidoptera, Hymenoptera, Diptera, …
Mammalia "mammalia" Carnivora, Chiroptera, Rodentia, Cetacea, …
Mollusca "mollusca" Gastropoda, Bivalvia, Cephalopoda, …
Reptilia "reptilia" Squamata, Testudines, Crocodylia, Sphenodontia
Tracheophyta (vascular plants) "tracheophyta" Magnoliopsida, Liliopsida, Polypodiopsida, …

Use descendants = "all" for the full group, or specify a descendant name (e.g., descendants = "coleoptera"). See Table 1 in the manuscript for the complete list of descendants per group.

Examples

Example Description Group Metric Resolution
Example 1 Total recorded bird species Aves n_sp 10 km
Example 2 Total bird observations Aves n_obs 10 km
Example 3 Cumulative recorded species richness (all groups) All n_sp 5 km
Example 4 Cumulative observations (all groups) All n_obs 5 km
Example 5 Bird observations for 2015–2024 Aves n_obs 5 km
Example 6 Descendant-Level Access Insects n_obs 10 km
Example 7 Percentage-based spatial coverage analysis Aves/All n_obs 20 km

Intended use

The pre-computed raster products and accompanying code are designed to support:

  • Bias quantification — measuring spatial sampling effort bias in GBIF-derived datasets
  • Bias-aware modelling — incorporating effort surfaces as covariates in ecological models
  • Coverage assessment — identifying spatial and temporal gaps in biodiversity monitoring
  • Reproducibility — regenerating or extending the sampling effort products for new taxonomic scopes or time periods

The pre-computed raster products are intended to serve as reusable infrastructure for macroecological, biogeographical, and biodiversity modelling analyses wherever explicit quantification and comparison of spatial sampling effort are needed.


Scope and limitations

This README focuses on accessing and using the published raster products. A full processing pipeline is also provided to ensure reproducibility; however, re-running it requires substantial computational resources and direct access to raw GBIF occurrence data.

The workflow is therefore intended for two complementary use cases: (i) direct use of the pre-computed sampling-effort rasters, and (ii) methodological extension or full reproduction under comparable computational environments.

Users who wish to extend the workflow — for example by adding taxonomic groups, updating GBIF downloads, or modifying spatial/temporal aggregation settings — should consult the Methods section of the manuscript together with process_effort.R and helper_functions.R.

Raster interpretation

The pre-computed raster products are provided on a geographic latitude-longitude grid (WGS84; EPSG:4326). As a result, grid-cell area varies with latitude, with cells becoming progressively smaller toward the poles. This has important implications for interpretation: raw values are not directly comparable across latitude as spatial densities without appropriate correction for cell area.

The rasters represent raw counts of (not area-standardised products):

  • n_obs: number of GBIF occurrences per grid cell
  • n_sp: number of recorded species per grid cell

The datasets are intended for relative assessment of sampling effort, bias-aware modelling, and exploratory analysis of spatial structure in sampling intensity. They should not be interpreted as direct measures of biodiversity density unless additional processing is applied.

Interpretation differs between metrics:

  • Observation counts (n_obs) can be converted into area-standardised measures (e.g., observations per km²) and are often useful as proxies for sampling intensity.
  • Species counts (n_sp) should not be normalised by area in a mechanistic sense, as species richness does not scale linearly with area at fine spatial resolutions and is strongly influenced by sampling completeness and detectability.

These rasters should therefore not be interpreted as equal-area representations unless transformed or standardised using appropriate geodetic methods.

Computing grid-cell area and area-standardised metrics

To support density-based analyses (particularly for observation counts), grid-cell areas can be computed using geodetic calculations on the WGS84 ellipsoid. These areas can then be used to normalise raster values and derive area-standardised sampling intensity. This consists of three steps:

  1. Load sampling-effort raster and verify coordinate system: The code below loads the pre-computed raster for bird observations (n_obs) at 10 km resolution. Each cell contains the number of GBIF records aggregated within that spatial unit. It is important to confirm that the raster is in a geographic coordinate reference system (WGS84), since area calculations depend on ellipsoidal geometry.
library(terra)

# Load sampling-effort raster
effort_birds <- ecokit::get_sampling_effort( 
  group = "aves", descendants = "all", metric = "n_obs", 
  years = "total", resolution = 10, out_dir = "effort_maps")
effort_birds_r <- terra::rast(effort_birds$local_path)
sf::st_crs(effort_birds_r)
# Coordinate Reference System:
#   User input: WGS 84 
#   wkt:
# GEOGCRS["WGS 84",
#     ENSEMBLE["World Geodetic System 1984 ensemble",
#         MEMBER["World Geodetic System 1984 (Transit)"],
#         MEMBER["World Geodetic System 1984 (G730)"],
#         MEMBER["World Geodetic System 1984 (G873)"],
#         MEMBER["World Geodetic System 1984 (G1150)"],
#         MEMBER["World Geodetic System 1984 (G1674)"],
#         MEMBER["World Geodetic System 1984 (G1762)"],
#         MEMBER["World Geodetic System 1984 (G2139)"],
#         MEMBER["World Geodetic System 1984 (G2296)"],
#         ELLIPSOID["WGS 84",6378137,298.257223563,
#             LENGTHUNIT["metre",1]],
#         ENSEMBLEACCURACY[2.0]],
#     PRIMEM["Greenwich",0,
#         ANGLEUNIT["degree",0.0174532925199433]],
#     CS[ellipsoidal,2],
#         AXIS["geodetic latitude (Lat)",north,
#             ORDER[1],
#             ANGLEUNIT["degree",0.0174532925199433]],
#         AXIS["geodetic longitude (Lon)",east,
#             ORDER[2],
#             ANGLEUNIT["degree",0.0174532925199433]],
#     USAGE[
#         SCOPE["Horizontal component of 3D system."],
#         AREA["World."],
#         BBOX[-90,-180,90,180]],
#     ID["EPSG",4326]]
  1. Compute grid-cell surface area: This step calculates the true surface area of each raster grid cell on the ellipsoid, expressed in square kilometres. Because geographic (latitude–longitude) grids are not area-preserving, cell size decreases with increasing latitude. This correction is therefore necessary for any density-based interpretation.
# Compute geodesic grid-cell area (km²) on WGS84 ellipsoid
effort_birds_area <- terra::cellSize(effort_birds_r, unit = "km")
  1. Compute area-standardised observation intensity: Finally, raw observation counts are divided by grid-cell area to obtain an estimate of observation density per square kilometre. This transformation is appropriate for analyses focusing on sampling intensity rather than raw counts.
# Compute observations per km²
effort_birds_stand <- effort_birds_r / effort_birds_area

Last update: 2026-05-06

About

A global, taxon-stratified, high-resolution sampling-effort dataset from GBIF for bias-aware ecological modelling — reproducible R-based workflow for generating global sampling-effort rasters from GBIF occurrence records across spatial, temporal, and taxonomic scales. For more details, see El-Gabbas (2026), Diversity and distributions (accepted).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors