This repository provides a reproducible R-based workflow for quantifying, analysing, and visualising spatial sampling effort derived from GBIF occurrence records at a global scale.
This repository accompanies the following manuscript:
El-Gabbas, A. (2026) A global, taxon-stratified, high-resolution sampling-effort dataset from GBIF for bias-aware ecological modelling. Diversity and Distributions. (accepted)
If you use the code, pre-computed raster products, or derived outputs in your work, please cite the publication above.
This repository delivers two complementary components:
- Pre-computed raster grids of global sampling effort —
summarising species counts and record counts per grid cell. These
are accessible programmatically via the
ecokit::get_sampling_effort()function (see Examples below). These rasters quantify spatial sampling effort across taxonomic groups, spatial grains, and temporal subsets. - A fully reproducible analysis pipeline for regenerating the sampling effort products from raw GBIF data.
Most users will only need to download and analyse the pre-computed rasters rather than re-run the full processing workflow.
| File | Description |
|---|---|
run_bias.R |
Entry-point script for the full analysis pipeline |
run_bias.slurm |
Example SLURM batch script for HPC execution on a high-performance computing (HPC) cluster |
process_effort.R |
Core function that reproduces the paper’s main results (spatial aggregation and bias metrics) |
helper_functions.R |
Supporting utilities for data parsing, aggregation, masking, and plotting |
example_*.qmd |
Worked examples demonstrating data access and visualisation |
The full workflow is computationally intensive — it performs global-scale rasterisation across multiple taxonomic groups, spatial resolutions, and years — and is designed to run on an HPC cluster.
To execute the pipeline using SLURM:
sbatch run_bias.slurmThis will:
- process raw occurrence data,
- aggregate records across spatial resolutions,
- compute sampling effort and bias metrics, and
- generate summary outputs.
Note: The pipeline is provided for full reproducibility. Most users will prefer to work directly with the pre-computed raster products (see below).
The ecokit::get_sampling_effort() function downloads pre-computed
raster products from the associated OSF project. It validates all inputs
and retrieves the corresponding GeoTIFF file(s) into a local directory.
Parameters:
| Parameter | Description |
|---|---|
group |
Taxonomic group (e.g., "aves", "mammalia"). See Table 1 in the manuscript for all available groups. |
descendants |
"all" or a specific descendant clade. See Table 1 in the manuscript for available options per group. |
metric |
Effort metric: "n_sp" (number of species) or "n_obs" (number of observations). |
years |
"total" for all years combined, or specific year(s) in 1980–2025. |
resolution |
Spatial resolution in km (1, 5, 10, or 20). |
out_dir |
Local directory for downloaded files. |
The function validates all inputs (group, descendant, metric, temporal subset, and resolution) and retrieves the corresponding GeoTIFF file(s) from the OSF archive into a local directory.
| Group | group value |
Example descendants |
|---|---|---|
| Amphibia | "amphibia" |
Anura, Caudata, Gymnophiona |
| Arachnida | "arachnida" |
Araneae, Scorpiones, Opiliones, … |
| Aves (birds) | "aves" |
Passeriformes, Accipitriformes, Anseriformes, … |
| Fungi | "fungi" |
Agaricomycetes, Lecanoromycetes, … |
| Insecta | "insecta" |
Coleoptera, Lepidoptera, Hymenoptera, Diptera, … |
| Mammalia | "mammalia" |
Carnivora, Chiroptera, Rodentia, Cetacea, … |
| Mollusca | "mollusca" |
Gastropoda, Bivalvia, Cephalopoda, … |
| Reptilia | "reptilia" |
Squamata, Testudines, Crocodylia, Sphenodontia |
| Tracheophyta (vascular plants) | "tracheophyta" |
Magnoliopsida, Liliopsida, Polypodiopsida, … |
Use descendants = "all" for the full group, or specify a descendant
name (e.g., descendants = "coleoptera"). See Table 1 in the manuscript
for the complete list of descendants per group.
| Example | Description | Group | Metric | Resolution |
|---|---|---|---|---|
| Example 1 | Total recorded bird species | Aves | n_sp |
10 km |
| Example 2 | Total bird observations | Aves | n_obs |
10 km |
| Example 3 | Cumulative recorded species richness (all groups) | All | n_sp |
5 km |
| Example 4 | Cumulative observations (all groups) | All | n_obs |
5 km |
| Example 5 | Bird observations for 2015–2024 | Aves | n_obs |
5 km |
| Example 6 | Descendant-Level Access | Insects | n_obs |
10 km |
| Example 7 | Percentage-based spatial coverage analysis | Aves/All | n_obs |
20 km |
The pre-computed raster products and accompanying code are designed to support:
- Bias quantification — measuring spatial sampling effort bias in GBIF-derived datasets
- Bias-aware modelling — incorporating effort surfaces as covariates in ecological models
- Coverage assessment — identifying spatial and temporal gaps in biodiversity monitoring
- Reproducibility — regenerating or extending the sampling effort products for new taxonomic scopes or time periods
The pre-computed raster products are intended to serve as reusable infrastructure for macroecological, biogeographical, and biodiversity modelling analyses wherever explicit quantification and comparison of spatial sampling effort are needed.
This README focuses on accessing and using the published raster products. A full processing pipeline is also provided to ensure reproducibility; however, re-running it requires substantial computational resources and direct access to raw GBIF occurrence data.
The workflow is therefore intended for two complementary use cases: (i) direct use of the pre-computed sampling-effort rasters, and (ii) methodological extension or full reproduction under comparable computational environments.
Users who wish to extend the workflow — for example by adding taxonomic
groups, updating GBIF downloads, or modifying spatial/temporal
aggregation settings — should consult the Methods section of the
manuscript together with process_effort.R and helper_functions.R.
The pre-computed raster products are provided on a geographic latitude-longitude grid (WGS84; EPSG:4326). As a result, grid-cell area varies with latitude, with cells becoming progressively smaller toward the poles. This has important implications for interpretation: raw values are not directly comparable across latitude as spatial densities without appropriate correction for cell area.
The rasters represent raw counts of (not area-standardised products):
n_obs: number of GBIF occurrences per grid celln_sp: number of recorded species per grid cell
The datasets are intended for relative assessment of sampling effort, bias-aware modelling, and exploratory analysis of spatial structure in sampling intensity. They should not be interpreted as direct measures of biodiversity density unless additional processing is applied.
Interpretation differs between metrics:
- Observation counts (
n_obs) can be converted into area-standardised measures (e.g., observations per km²) and are often useful as proxies for sampling intensity. - Species counts (
n_sp) should not be normalised by area in a mechanistic sense, as species richness does not scale linearly with area at fine spatial resolutions and is strongly influenced by sampling completeness and detectability.
These rasters should therefore not be interpreted as equal-area representations unless transformed or standardised using appropriate geodetic methods.
To support density-based analyses (particularly for observation counts), grid-cell areas can be computed using geodetic calculations on the WGS84 ellipsoid. These areas can then be used to normalise raster values and derive area-standardised sampling intensity. This consists of three steps:
- Load sampling-effort raster and verify coordinate system: The
code below loads the pre-computed raster for bird observations
(
n_obs) at 10 km resolution. Each cell contains the number of GBIF records aggregated within that spatial unit. It is important to confirm that the raster is in a geographic coordinate reference system (WGS84), since area calculations depend on ellipsoidal geometry.
library(terra)
# Load sampling-effort raster
effort_birds <- ecokit::get_sampling_effort(
group = "aves", descendants = "all", metric = "n_obs",
years = "total", resolution = 10, out_dir = "effort_maps")
effort_birds_r <- terra::rast(effort_birds$local_path)sf::st_crs(effort_birds_r)
# Coordinate Reference System:
# User input: WGS 84
# wkt:
# GEOGCRS["WGS 84",
# ENSEMBLE["World Geodetic System 1984 ensemble",
# MEMBER["World Geodetic System 1984 (Transit)"],
# MEMBER["World Geodetic System 1984 (G730)"],
# MEMBER["World Geodetic System 1984 (G873)"],
# MEMBER["World Geodetic System 1984 (G1150)"],
# MEMBER["World Geodetic System 1984 (G1674)"],
# MEMBER["World Geodetic System 1984 (G1762)"],
# MEMBER["World Geodetic System 1984 (G2139)"],
# MEMBER["World Geodetic System 1984 (G2296)"],
# ELLIPSOID["WGS 84",6378137,298.257223563,
# LENGTHUNIT["metre",1]],
# ENSEMBLEACCURACY[2.0]],
# PRIMEM["Greenwich",0,
# ANGLEUNIT["degree",0.0174532925199433]],
# CS[ellipsoidal,2],
# AXIS["geodetic latitude (Lat)",north,
# ORDER[1],
# ANGLEUNIT["degree",0.0174532925199433]],
# AXIS["geodetic longitude (Lon)",east,
# ORDER[2],
# ANGLEUNIT["degree",0.0174532925199433]],
# USAGE[
# SCOPE["Horizontal component of 3D system."],
# AREA["World."],
# BBOX[-90,-180,90,180]],
# ID["EPSG",4326]]- Compute grid-cell surface area: This step calculates the true surface area of each raster grid cell on the ellipsoid, expressed in square kilometres. Because geographic (latitude–longitude) grids are not area-preserving, cell size decreases with increasing latitude. This correction is therefore necessary for any density-based interpretation.
# Compute geodesic grid-cell area (km²) on WGS84 ellipsoid
effort_birds_area <- terra::cellSize(effort_birds_r, unit = "km")- Compute area-standardised observation intensity: Finally, raw observation counts are divided by grid-cell area to obtain an estimate of observation density per square kilometre. This transformation is appropriate for analyses focusing on sampling intensity rather than raw counts.
# Compute observations per km²
effort_birds_stand <- effort_birds_r / effort_birds_area