Hallucination benchmarks mostly generates false or confabulated content by prompting an LLM. We present an approach human-centered where a non-expert writes the false responses from memory, without searching any source, producing confabulations (in the neuropsychological sense - Berlyne, 1972).
As a consequence, detection methods that reach 88-97% accuracy on LLM-generated benchmarks drop to 69--78% on these human confabulations. The false responses stay within the distributional register of their domain, making them invisible to cosine-similarity methods.
Hallucination detection is both a research problem and a regulatory requirement (for example Article 15 of the EU AI Act). Existing benchmarks validate detection methods against LLM-generated false content --- but production systems encounter human-like confabulations: confident, domain-appropriate text that is wrong in domain-specific ways.
There is not any specific methodology existed for generated human-confabulated benchmarks. We provide one.
data/human_confabulations.csv --- 212 question--response pairs across nine domains.
787| Domain | Pairs | Knowledge type | |---|--:|---| | Python coding | 47 | Technical specification | | Finance | 40 | Regulatory / procedural | | Medical | 40 | Clinical / declarative | | Science | 21 | Declarative fact | | TypeScript coding | 18 | Technical specification | | History | 14 | Declarative fact | | Law | 11 | Regulatory / procedural | | General knowledge | 11 | Mixed | | Geography | 10 | Declarative fact |
Semicolon-delimited CSV with four columns:
| Column | Description |
|---|---|
domain |
Knowledge domain (e.g., finance, medical, python_coding) |
question |
The question posed |
grounded_response |
Verified correct answer (generated by Claude Sonnet 4.5, manually verified against authoritative sources) |
fabricated_response |
Human-written confabulation (written from memory, no sources consulted) |
Each pair was built following one instruction:
Write a response that would sound convincing to someone who does not know the subject, without looking up the answer, inventing every factual claim.
The confabulator (a non-expert in each domain) wrote from memory, filling knowledge gaps with plausible-sounding material. This operationalizes confabulation as defined in neuropsychology: the production of false information without intent to deceive.
Analysis of the resulting confabulations shows five main strategies:
| Strategy | Mechanism | Grounded | Confabulated | Why it evades detection |
|---|---|---|---|---|
| Redefinition within the register | Redefines a term while staying in the same vocabulary | P/E = Price-to-Earnings ratio, dividing stock price by EPS | P/E = Price-to-Exit ratio, used by Private Equity firms for divestiture valuation | Both responses share the same financial vocabulary; embeddings encode co-occurrence, not whether the definition is real |
| Mechanism inversion | Reverses a process while preserving local transitions | Plants absorb CO₂ and water, producing glucose and oxygen | Plants absorb oxygen, converting it into nitrogen compounds that fertilize soil | Each local transition ("plants → absorb", "convert → into → compounds") is distributionally plausible; the error is global |
| Entity invention through composition | Combines real entities into a fictitious mechanism | The pancreas produces insulin via beta cells in the islets of Langerhans | The brain produces insulin in the hypothalamus via specialized neural receptors | "Hypothalamus + monitors + blood glucose" is a valid composition in medical text — the composed meaning doesn't correspond to reality |
| Reinterpretation through polysemy | Exploits word ambiguity to shift meaning | Habeas corpus = right to challenge unlawful detention ("have the body") | Habeas corpus = "have the body of evidence", the prosecution's obligation to present evidence | Context ("evidence", "prosecution", "present") consistently supports the wrong sense; the model resolves the ambiguity accordingly |
| Template-filling | Preserves discourse structure, replaces every fact | __init__ is the constructor, called automatically when a class is instantiated |
__init__ is a private system function that initializes Python's garbage collector at script startup |
The template (method name → classification → trigger → effect) carries the distributional signal; the specific content filling each slot contributes little |
Each strategy preserves a different subset of the distributional properties that embedding models encode, while violating referential truth - which is not a distributional property.
import pandas as pd
df = pd.read_csv("data/human_confabulations.csv")
# Browse a domain
finance = df[df["domain"] == "finance"]
print(f"Finance pairs: {len(finance)}")
print(finance.iloc[0]["question"])
print(finance.iloc[0]["grounded_response"])
print(finance.iloc[0]["fabricated_response"])import numpy as np
import numpy.typing as npt
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
df = pd.read_csv("data/human_confabulations.csv")
questions = model.encode(
df["question"].tolist()
)
grounded = model.encode(
df["grounded_response"].tolist()
)
fabricated = model.encode(
df["fabricated_response"].tolist()
)
# Detection: does cos(q, grounded) > cos(q, fabricated)?
def pairwise_cosine(
a: npt.NDArray,
b: npt.NDArray
) -> npt.NDArray:
"""Row-wise cosine similarity between aligned embedding matrices."""
a_n = a / np.linalg.norm(a, axis=1, keepdims=True)
b_n = b / np.linalg.norm(b, axis=1, keepdims=True)
return np.sum(a_n * b_n, axis=1)
accuracy = float(
np.mean(
pairwise_cosine(questions, grounded)
> pairwise_cosine(questions, fabricated)
)
)
print(f"Detection accuracy: {accuracy:.1%}")
# Expected: ~69-78% (vs 88-97% on LLM-generated benchmarks)See scripts/ for the full experiment code.
.
├── README.md
├── LICENSE
├── CITATION.cff
├── DATASHEET.md # Datasheet for Datasets (Gebru et al., 2021)
├── data/
│ └── human_confabulations.csv # The 212-pair benchmark
├── paper/
│ ├── paper.pdf # Explanation paper
├── scripts/
│ └── validate.py # Reproduce detection experiment
└── examples/
└── basic_application.py # Minimal working example
Detection accuracy drops when benchmarks are built from human confabulations instead of LLM-generated content:
| Benchmark | Detection accuracy | Paired similarity |
|---|---|---|
| HaluEval (LLM-generated) | 88--97% | 0.10--0.78 |
| LLM confabulations (same questions) | 73--76% | 0.86--0.96 |
| Human confabulations (this dataset) | 69--78% | 0.72--0.92 |
Ranges across four embedding models: all-MiniLM-L6-v2, all-mpnet-base-v2, bge-small-en-v1.5, gte-small.
The distributional hypothesis (Harris, 1954) explains why: sentence embeddings encode co-occurrence patterns, not referential truth. Confabulations that stay within the register of their domain are invisible to cosine-similarity methods.
If you use this dataset or methodology in your research, please cite:
@misc{marin2026confabulation,
author = {Mar{\'\i}n, Javier},
title = {A Methodology for Building Human-Confabulated Hallucination Benchmarks},
year = {2026},
url = {https://github.com/Javihaus/cert-confabulation-benchmark}
}This benchmark is part of the CERT framework for hallucination detection in production LLM deployments:
- Semantic Grounding Index (SGI): Geometric bounds on context engagement in RAG systems (arXiv:2512.13771)
The dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Code is released under the MIT License.