A Methodology for Building a Human-Confabulated Hallucination Benchmarks

Hallucination benchmarks mostly generates false or confabulated content by prompting an LLM. We present an approach human-centered where a non-expert writes the false responses from memory, without searching any source, producing confabulations (in the neuropsychological sense - Berlyne, 1972).

As a consequence, detection methods that reach 88-97% accuracy on LLM-generated benchmarks drop to 69--78% on these human confabulations. The false responses stay within the distributional register of their domain, making them invisible to cosine-similarity methods.

Motivation

Hallucination detection is both a research problem and a regulatory requirement (for example Article 15 of the EU AI Act). Existing benchmarks validate detection methods against LLM-generated false content --- but production systems encounter human-like confabulations: confident, domain-appropriate text that is wrong in domain-specific ways.

There is not any specific methodology existed for generated human-confabulated benchmarks. We provide one.

Dataset

data/human_confabulations.csv --- 212 question--response pairs across nine domains.

787| Domain | Pairs | Knowledge type | |---|--:|---| | Python coding | 47 | Technical specification | | Finance | 40 | Regulatory / procedural | | Medical | 40 | Clinical / declarative | | Science | 21 | Declarative fact | | TypeScript coding | 18 | Technical specification | | History | 14 | Declarative fact | | Law | 11 | Regulatory / procedural | | General knowledge | 11 | Mixed | | Geography | 10 | Declarative fact |

Format

Semicolon-delimited CSV with four columns:

Column	Description
`domain`	Knowledge domain (e.g., `finance`, `medical`, `python_coding`)
`question`	The question posed
`grounded_response`	Verified correct answer (generated by Claude Sonnet 4.5, manually verified against authoritative sources)
`fabricated_response`	Human-written confabulation (written from memory, no sources consulted)

Building methodology

Each pair was built following one instruction:

Write a response that would sound convincing to someone who does not know the subject, without looking up the answer, inventing every factual claim.

The confabulator (a non-expert in each domain) wrote from memory, filling knowledge gaps with plausible-sounding material. This operationalizes confabulation as defined in neuropsychology: the production of false information without intent to deceive.

Analysis of the resulting confabulations shows five main strategies:

Strategy	Mechanism	Grounded	Confabulated	Why it evades detection
Redefinition within the register	Redefines a term while staying in the same vocabulary	P/E = Price-to-Earnings ratio, dividing stock price by EPS	P/E = Price-to-Exit ratio, used by Private Equity firms for divestiture valuation	Both responses share the same financial vocabulary; embeddings encode co-occurrence, not whether the definition is real
Mechanism inversion	Reverses a process while preserving local transitions	Plants absorb CO₂ and water, producing glucose and oxygen	Plants absorb oxygen, converting it into nitrogen compounds that fertilize soil	Each local transition ("plants → absorb", "convert → into → compounds") is distributionally plausible; the error is global
Entity invention through composition	Combines real entities into a fictitious mechanism	The pancreas produces insulin via beta cells in the islets of Langerhans	The brain produces insulin in the hypothalamus via specialized neural receptors	"Hypothalamus + monitors + blood glucose" is a valid composition in medical text — the composed meaning doesn't correspond to reality
Reinterpretation through polysemy	Exploits word ambiguity to shift meaning	Habeas corpus = right to challenge unlawful detention ("have the body")	Habeas corpus = "have the body of evidence", the prosecution's obligation to present evidence	Context ("evidence", "prosecution", "present") consistently supports the wrong sense; the model resolves the ambiguity accordingly
Template-filling	Preserves discourse structure, replaces every fact	`__init__` is the constructor, called automatically when a class is instantiated	`__init__` is a private system function that initializes Python's garbage collector at script startup	The template (method name → classification → trigger → effect) carries the distributional signal; the specific content filling each slot contributes little

Each strategy preserves a different subset of the distributional properties that embedding models encode, while violating referential truth - which is not a distributional property.

Quick start

Python

import pandas as pd

df = pd.read_csv("data/human_confabulations.csv")

# Browse a domain
finance = df[df["domain"] == "finance"]
print(f"Finance pairs: {len(finance)}")
print(finance.iloc[0]["question"])
print(finance.iloc[0]["grounded_response"])
print(finance.iloc[0]["fabricated_response"])

Embedding experiment

import numpy as np
import numpy.typing as npt
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
df = pd.read_csv("data/human_confabulations.csv")

questions = model.encode(
                df["question"].tolist()
)
grounded = model.encode(
                df["grounded_response"].tolist()
)
fabricated = model.encode(
                df["fabricated_response"].tolist()
)

# Detection: does cos(q, grounded) > cos(q, fabricated)?
def pairwise_cosine(
    a: npt.NDArray,
    b: npt.NDArray
    ) -> npt.NDArray:
    """Row-wise cosine similarity between aligned embedding matrices."""
    a_n = a / np.linalg.norm(a, axis=1, keepdims=True)
    b_n = b / np.linalg.norm(b, axis=1, keepdims=True)
    return np.sum(a_n * b_n, axis=1)

accuracy = float(
    np.mean(
        pairwise_cosine(questions, grounded)
        > pairwise_cosine(questions, fabricated)
    )
)
print(f"Detection accuracy: {accuracy:.1%}")
# Expected: ~69-78% (vs 88-97% on LLM-generated benchmarks)

See scripts/ for the full experiment code.

Repository structure

.
├── README.md
├── LICENSE
├── CITATION.cff
├── DATASHEET.md                  # Datasheet for Datasets (Gebru et al., 2021)
├── data/
│   └── human_confabulations.csv   # The 212-pair benchmark
├── paper/
│   ├── paper.pdf                 # Explanation paper
├── scripts/
│   └── validate.py               # Reproduce detection experiment
└── examples/
    └── basic_application.py      # Minimal working example

Experimental Results

Detection accuracy drops when benchmarks are built from human confabulations instead of LLM-generated content:

Benchmark	Detection accuracy	Paired similarity
HaluEval (LLM-generated)	88--97%	0.10--0.78
LLM confabulations (same questions)	73--76%	0.86--0.96
Human confabulations (this dataset)	69--78%	0.72--0.92

Ranges across four embedding models: all-MiniLM-L6-v2, all-mpnet-base-v2, bge-small-en-v1.5, gte-small.

The distributional hypothesis (Harris, 1954) explains why: sentence embeddings encode co-occurrence patterns, not referential truth. Confabulations that stay within the register of their domain are invisible to cosine-similarity methods.

Citation

If you use this dataset or methodology in your research, please cite:

@misc{marin2026confabulation,
  author       = {Mar{\'\i}n, Javier},
  title        = {A Methodology for Building Human-Confabulated Hallucination Benchmarks},
  year         = {2026},
  url          = {https://github.com/Javihaus/cert-confabulation-benchmark}
}

Related work

This benchmark is part of the CERT framework for hallucination detection in production LLM deployments:

Semantic Grounding Index (SGI): Geometric bounds on context engagement in RAG systems (arXiv:2512.13771)

License

The dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Code is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Methodology for Building a Human-Confabulated Hallucination Benchmarks

Motivation

Dataset

Format

Building methodology

Quick start

Python

Embedding experiment

Repository structure

Experimental Results

Citation

Related work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
examples		examples
paper		paper
scripts		scripts
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASHEET.md		DATASHEET.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

A Methodology for Building a Human-Confabulated Hallucination Benchmarks

Motivation

Dataset

Format

Building methodology

Quick start

Python

Embedding experiment

Repository structure

Experimental Results

Citation

Related work

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages