Global Sampling Effort Dataset from GBIF

This repository provides a reproducible R-based workflow for quantifying, analysing, and visualising spatial sampling effort derived from GBIF occurrence records at a global scale.

Citation

This repository accompanies the following manuscript:

El-Gabbas, A. (2026) A global, taxon-stratified, high-resolution sampling-effort dataset from GBIF for bias-aware ecological modelling. Diversity and Distributions. (accepted)

If you use the code, pre-computed raster products, or derived outputs in your work, please cite the publication above.

What this repository provides

This repository delivers two complementary components:

Pre-computed raster grids of global sampling effort — summarising species counts and record counts per grid cell. These are accessible programmatically via the ecokit::get_sampling_effort() function (see Examples below). These rasters quantify spatial sampling effort across taxonomic groups, spatial grains, and temporal subsets.
A fully reproducible analysis pipeline for regenerating the sampling effort products from raw GBIF data.

Most users will only need to download and analyse the pre-computed rasters rather than re-run the full processing workflow.

Repository structure

File	Description
`run_bias.R`	Entry-point script for the full analysis pipeline
`run_bias.slurm`	Example SLURM batch script for HPC execution on a high-performance computing (HPC) cluster
`process_effort.R`	Core function that reproduces the paper’s main results (spatial aggregation and bias metrics)
`helper_functions.R`	Supporting utilities for data parsing, aggregation, masking, and plotting
`example_*.qmd`	Worked examples demonstrating data access and visualisation

Running the full analysis pipeline (HPC)

The full workflow is computationally intensive — it performs global-scale rasterisation across multiple taxonomic groups, spatial resolutions, and years — and is designed to run on an HPC cluster.

To execute the pipeline using SLURM:

sbatch run_bias.slurm

This will:

process raw occurrence data,
aggregate records across spatial resolutions,
compute sampling effort and bias metrics, and
generate summary outputs.

Note: The pipeline is provided for full reproducibility. Most users will prefer to work directly with the pre-computed raster products (see below).

Accessing pre-computed sampling effort data

The ecokit::get_sampling_effort() function downloads pre-computed raster products from the associated OSF project. It validates all inputs and retrieves the corresponding GeoTIFF file(s) into a local directory.

Parameters:

Parameter	Description
`group`	Taxonomic group (e.g., `"aves"`, `"mammalia"`). See Table 1 in the manuscript for all available groups.
`descendants`	`"all"` or a specific descendant clade. See Table 1 in the manuscript for available options per group.
`metric`	Effort metric: `"n_sp"` (number of species) or `"n_obs"` (number of observations).
`years`	`"total"` for all years combined, or specific year(s) in 1980–2025.
`resolution`	Spatial resolution in km (`1`, `5`, `10`, or `20`).
`out_dir`	Local directory for downloaded files.

The function validates all inputs (group, descendant, metric, temporal subset, and resolution) and retrieves the corresponding GeoTIFF file(s) from the OSF archive into a local directory.

Available taxonomic groups

Group	`group` value	Example descendants
Amphibia	`"amphibia"`	Anura, Caudata, Gymnophiona
Arachnida	`"arachnida"`	Araneae, Scorpiones, Opiliones, …
Aves (birds)	`"aves"`	Passeriformes, Accipitriformes, Anseriformes, …
Fungi	`"fungi"`	Agaricomycetes, Lecanoromycetes, …
Insecta	`"insecta"`	Coleoptera, Lepidoptera, Hymenoptera, Diptera, …
Mammalia	`"mammalia"`	Carnivora, Chiroptera, Rodentia, Cetacea, …
Mollusca	`"mollusca"`	Gastropoda, Bivalvia, Cephalopoda, …
Reptilia	`"reptilia"`	Squamata, Testudines, Crocodylia, Sphenodontia
Tracheophyta (vascular plants)	`"tracheophyta"`	Magnoliopsida, Liliopsida, Polypodiopsida, …

Use descendants = "all" for the full group, or specify a descendant name (e.g., descendants = "coleoptera"). See Table 1 in the manuscript for the complete list of descendants per group.

Examples

Example	Description	Group	Metric	Resolution
Example 1	Total recorded bird species	Aves	`n_sp`	10 km
Example 2	Total bird observations	Aves	`n_obs`	10 km
Example 3	Cumulative recorded species richness (all groups)	All	`n_sp`	5 km
Example 4	Cumulative observations (all groups)	All	`n_obs`	5 km
Example 5	Bird observations for 2015–2024	Aves	`n_obs`	5 km
Example 6	Descendant-Level Access	Insects	`n_obs`	10 km
Example 7	Percentage-based spatial coverage analysis	Aves/All	`n_obs`	20 km

Intended use

The pre-computed raster products and accompanying code are designed to support:

Bias quantification — measuring spatial sampling effort bias in GBIF-derived datasets
Bias-aware modelling — incorporating effort surfaces as covariates in ecological models
Coverage assessment — identifying spatial and temporal gaps in biodiversity monitoring
Reproducibility — regenerating or extending the sampling effort products for new taxonomic scopes or time periods

The pre-computed raster products are intended to serve as reusable infrastructure for macroecological, biogeographical, and biodiversity modelling analyses wherever explicit quantification and comparison of spatial sampling effort are needed.

Scope and limitations

This README focuses on accessing and using the published raster products. A full processing pipeline is also provided to ensure reproducibility; however, re-running it requires substantial computational resources and direct access to raw GBIF occurrence data.

The workflow is therefore intended for two complementary use cases: (i) direct use of the pre-computed sampling-effort rasters, and (ii) methodological extension or full reproduction under comparable computational environments.

Users who wish to extend the workflow — for example by adding taxonomic groups, updating GBIF downloads, or modifying spatial/temporal aggregation settings — should consult the Methods section of the manuscript together with process_effort.R and helper_functions.R.

Raster interpretation

The pre-computed raster products are provided on a geographic latitude-longitude grid (WGS84; EPSG:4326). As a result, grid-cell area varies with latitude, with cells becoming progressively smaller toward the poles. This has important implications for interpretation: raw values are not directly comparable across latitude as spatial densities without appropriate correction for cell area.

The rasters represent raw counts of (not area-standardised products):

n_obs: number of GBIF occurrences per grid cell
n_sp: number of recorded species per grid cell

The datasets are intended for relative assessment of sampling effort, bias-aware modelling, and exploratory analysis of spatial structure in sampling intensity. They should not be interpreted as direct measures of biodiversity density unless additional processing is applied.

Interpretation differs between metrics:

Observation counts (n_obs) can be converted into area-standardised measures (e.g., observations per km²) and are often useful as proxies for sampling intensity.
Species counts (n_sp) should not be normalised by area in a mechanistic sense, as species richness does not scale linearly with area at fine spatial resolutions and is strongly influenced by sampling completeness and detectability.

These rasters should therefore not be interpreted as equal-area representations unless transformed or standardised using appropriate geodetic methods.

Computing grid-cell area and area-standardised metrics

To support density-based analyses (particularly for observation counts), grid-cell areas can be computed using geodetic calculations on the WGS84 ellipsoid. These areas can then be used to normalise raster values and derive area-standardised sampling intensity. This consists of three steps:

Load sampling-effort raster and verify coordinate system: The code below loads the pre-computed raster for bird observations (n_obs) at 10 km resolution. Each cell contains the number of GBIF records aggregated within that spatial unit. It is important to confirm that the raster is in a geographic coordinate reference system (WGS84), since area calculations depend on ellipsoidal geometry.

library(terra)

# Load sampling-effort raster
effort_birds <- ecokit::get_sampling_effort( 
  group = "aves", descendants = "all", metric = "n_obs", 
  years = "total", resolution = 10, out_dir = "effort_maps")
effort_birds_r <- terra::rast(effort_birds$local_path)

sf::st_crs(effort_birds_r)
# Coordinate Reference System:
#   User input: WGS 84 
#   wkt:
# GEOGCRS["WGS 84",
#     ENSEMBLE["World Geodetic System 1984 ensemble",
#         MEMBER["World Geodetic System 1984 (Transit)"],
#         MEMBER["World Geodetic System 1984 (G730)"],
#         MEMBER["World Geodetic System 1984 (G873)"],
#         MEMBER["World Geodetic System 1984 (G1150)"],
#         MEMBER["World Geodetic System 1984 (G1674)"],
#         MEMBER["World Geodetic System 1984 (G1762)"],
#         MEMBER["World Geodetic System 1984 (G2139)"],
#         MEMBER["World Geodetic System 1984 (G2296)"],
#         ELLIPSOID["WGS 84",6378137,298.257223563,
#             LENGTHUNIT["metre",1]],
#         ENSEMBLEACCURACY[2.0]],
#     PRIMEM["Greenwich",0,
#         ANGLEUNIT["degree",0.0174532925199433]],
#     CS[ellipsoidal,2],
#         AXIS["geodetic latitude (Lat)",north,
#             ORDER[1],
#             ANGLEUNIT["degree",0.0174532925199433]],
#         AXIS["geodetic longitude (Lon)",east,
#             ORDER[2],
#             ANGLEUNIT["degree",0.0174532925199433]],
#     USAGE[
#         SCOPE["Horizontal component of 3D system."],
#         AREA["World."],
#         BBOX[-90,-180,90,180]],
#     ID["EPSG",4326]]

Compute grid-cell surface area: This step calculates the true surface area of each raster grid cell on the ellipsoid, expressed in square kilometres. Because geographic (latitude–longitude) grids are not area-preserving, cell size decreases with increasing latitude. This correction is therefore necessary for any density-based interpretation.

# Compute geodesic grid-cell area (km²) on WGS84 ellipsoid
effort_birds_area <- terra::cellSize(effort_birds_r, unit = "km")

Compute area-standardised observation intensity: Finally, raw observation counts are divided by grid-cell area to obtain an estimate of observation density per square kilometre. This transformation is appropriate for analyses focusing on sampling intensity rather than raw counts.

# Compute observations per km²
effort_birds_stand <- effort_birds_r / effort_birds_area

— Last update: 2026-05-06

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
figures		figures
.gitignore		.gitignore
.lintr		.lintr
README.md		README.md
README.qmd		README.qmd
example_1.md		example_1.md
example_1.qmd		example_1.qmd
example_2.md		example_2.md
example_2.qmd		example_2.qmd
example_3.md		example_3.md
example_3.qmd		example_3.qmd
example_4.md		example_4.md
example_4.qmd		example_4.qmd
example_5.md		example_5.md
example_5.qmd		example_5.qmd
example_6.md		example_6.md
example_6.qmd		example_6.qmd
example_7.md		example_7.md
example_7.qmd		example_7.qmd
global_sampling_efforts.Rproj		global_sampling_efforts.Rproj
global_sampling_efforts.code-workspace		global_sampling_efforts.code-workspace
helper_functions.R		helper_functions.R
process_efforts.R		process_efforts.R
run_bias.R		run_bias.R
run_bias.slurm		run_bias.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Sampling Effort Dataset from GBIF

Citation

What this repository provides

Repository structure

Running the full analysis pipeline (HPC)

Accessing pre-computed sampling effort data

Available taxonomic groups

Examples

Intended use

Scope and limitations

Raster interpretation

Computing grid-cell area and area-standardised metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Global Sampling Effort Dataset from GBIF

Citation

What this repository provides

Repository structure

Running the full analysis pipeline (HPC)

Accessing pre-computed sampling effort data

Available taxonomic groups

Examples

Intended use

Scope and limitations

Raster interpretation

Computing grid-cell area and area-standardised metrics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages