Skip to content

groundlens-dev/grounding-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Methodology for Building a Human-Confabulated Hallucination Benchmarks

Paper Python 3.10+ License: CC BY 4.0 Dataset: 212 pairs


Hallucination benchmarks mostly generates false or confabulated content by prompting an LLM. We present an approach human-centered where a non-expert writes the false responses from memory, without searching any source, producing confabulations (in the neuropsychological sense - Berlyne, 1972).

As a consequence, detection methods that reach 88-97% accuracy on LLM-generated benchmarks drop to 69--78% on these human confabulations. The false responses stay within the distributional register of their domain, making them invisible to cosine-similarity methods.

Motivation

Hallucination detection is both a research problem and a regulatory requirement (for example Article 15 of the EU AI Act). Existing benchmarks validate detection methods against LLM-generated false content --- but production systems encounter human-like confabulations: confident, domain-appropriate text that is wrong in domain-specific ways.

There is not any specific methodology existed for generated human-confabulated benchmarks. We provide one.

Dataset

data/human_confabulations.csv --- 212 question--response pairs across nine domains.

787| Domain | Pairs | Knowledge type | |---|--:|---| | Python coding | 47 | Technical specification | | Finance | 40 | Regulatory / procedural | | Medical | 40 | Clinical / declarative | | Science | 21 | Declarative fact | | TypeScript coding | 18 | Technical specification | | History | 14 | Declarative fact | | Law | 11 | Regulatory / procedural | | General knowledge | 11 | Mixed | | Geography | 10 | Declarative fact |

Format

Semicolon-delimited CSV with four columns:

Column Description
domain Knowledge domain (e.g., finance, medical, python_coding)
question The question posed
grounded_response Verified correct answer (generated by Claude Sonnet 4.5, manually verified against authoritative sources)
fabricated_response Human-written confabulation (written from memory, no sources consulted)

Building methodology

Each pair was built following one instruction:

Write a response that would sound convincing to someone who does not know the subject, without looking up the answer, inventing every factual claim.

The confabulator (a non-expert in each domain) wrote from memory, filling knowledge gaps with plausible-sounding material. This operationalizes confabulation as defined in neuropsychology: the production of false information without intent to deceive.

Analysis of the resulting confabulations shows five main strategies:

Strategy Mechanism Grounded Confabulated Why it evades detection
Redefinition within the register Redefines a term while staying in the same vocabulary P/E = Price-to-Earnings ratio, dividing stock price by EPS P/E = Price-to-Exit ratio, used by Private Equity firms for divestiture valuation Both responses share the same financial vocabulary; embeddings encode co-occurrence, not whether the definition is real
Mechanism inversion Reverses a process while preserving local transitions Plants absorb CO₂ and water, producing glucose and oxygen Plants absorb oxygen, converting it into nitrogen compounds that fertilize soil Each local transition ("plants → absorb", "convert → into → compounds") is distributionally plausible; the error is global
Entity invention through composition Combines real entities into a fictitious mechanism The pancreas produces insulin via beta cells in the islets of Langerhans The brain produces insulin in the hypothalamus via specialized neural receptors "Hypothalamus + monitors + blood glucose" is a valid composition in medical text — the composed meaning doesn't correspond to reality
Reinterpretation through polysemy Exploits word ambiguity to shift meaning Habeas corpus = right to challenge unlawful detention ("have the body") Habeas corpus = "have the body of evidence", the prosecution's obligation to present evidence Context ("evidence", "prosecution", "present") consistently supports the wrong sense; the model resolves the ambiguity accordingly
Template-filling Preserves discourse structure, replaces every fact __init__ is the constructor, called automatically when a class is instantiated __init__ is a private system function that initializes Python's garbage collector at script startup The template (method name → classification → trigger → effect) carries the distributional signal; the specific content filling each slot contributes little

Each strategy preserves a different subset of the distributional properties that embedding models encode, while violating referential truth - which is not a distributional property.

Quick start

Python

import pandas as pd

df = pd.read_csv("data/human_confabulations.csv")

# Browse a domain
finance = df[df["domain"] == "finance"]
print(f"Finance pairs: {len(finance)}")
print(finance.iloc[0]["question"])
print(finance.iloc[0]["grounded_response"])
print(finance.iloc[0]["fabricated_response"])

Embedding experiment

import numpy as np
import numpy.typing as npt
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
df = pd.read_csv("data/human_confabulations.csv")

questions = model.encode(
                df["question"].tolist()
)
grounded = model.encode(
                df["grounded_response"].tolist()
)
fabricated = model.encode(
                df["fabricated_response"].tolist()
)

# Detection: does cos(q, grounded) > cos(q, fabricated)?
def pairwise_cosine(
    a: npt.NDArray,
    b: npt.NDArray
    ) -> npt.NDArray:
    """Row-wise cosine similarity between aligned embedding matrices."""
    a_n = a / np.linalg.norm(a, axis=1, keepdims=True)
    b_n = b / np.linalg.norm(b, axis=1, keepdims=True)
    return np.sum(a_n * b_n, axis=1)

accuracy = float(
    np.mean(
        pairwise_cosine(questions, grounded)
        > pairwise_cosine(questions, fabricated)
    )
)
print(f"Detection accuracy: {accuracy:.1%}")
# Expected: ~69-78% (vs 88-97% on LLM-generated benchmarks)

See scripts/ for the full experiment code.

Repository structure

.
├── README.md
├── LICENSE
├── CITATION.cff
├── DATASHEET.md                  # Datasheet for Datasets (Gebru et al., 2021)
├── data/
│   └── human_confabulations.csv   # The 212-pair benchmark
├── paper/
│   ├── paper.pdf                 # Explanation paper
├── scripts/
│   └── validate.py               # Reproduce detection experiment
└── examples/
    └── basic_application.py      # Minimal working example

Experimental Results

Detection accuracy drops when benchmarks are built from human confabulations instead of LLM-generated content:

Benchmark Detection accuracy Paired similarity
HaluEval (LLM-generated) 88--97% 0.10--0.78
LLM confabulations (same questions) 73--76% 0.86--0.96
Human confabulations (this dataset) 69--78% 0.72--0.92

Ranges across four embedding models: all-MiniLM-L6-v2, all-mpnet-base-v2, bge-small-en-v1.5, gte-small.

The distributional hypothesis (Harris, 1954) explains why: sentence embeddings encode co-occurrence patterns, not referential truth. Confabulations that stay within the register of their domain are invisible to cosine-similarity methods.

Citation

If you use this dataset or methodology in your research, please cite:

@misc{marin2026confabulation,
  author       = {Mar{\'\i}n, Javier},
  title        = {A Methodology for Building Human-Confabulated Hallucination Benchmarks},
  year         = {2026},
  url          = {https://github.com/Javihaus/cert-confabulation-benchmark}
}

Related work

This benchmark is part of the CERT framework for hallucination detection in production LLM deployments:

  • Semantic Grounding Index (SGI): Geometric bounds on context engagement in RAG systems (arXiv:2512.13771)

License

The dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Code is released under the MIT License.

About

Curated dataset for evaluating LLM grounding — 215 claim–source pairs across 19 domains with human-annotated labels and geometric verification scores.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages