OrgMemBench

A benchmark for long-horizon organizational memory in AI agents.

Paper · Data · Reproduce · Leaderboard · Add your system

Can an AI agent answer what an organization decided, by whom, when, and why, across many channels, many authors, and years of accumulated history? OrgMemBench measures exactly that. It is a synthetic, multi-author, multi-channel, bi-temporal corpus whose ground truth is projected deterministically from a known source-of-truth graph, not read back out of free text.

Evaluating four publicly available memory systems under one uniform harness, we find organizational-scale memory far from solved: the strongest system reaches only ~0.43 (mean rubric score on the medium tier; ~0.39 on small), the field clusters between 0.14 and 0.43, and every system collapses on the temporal, contradiction, and justification-chain reasoning that defines organizational memory.

Leaderboard

Mean rubric-weighted score by tier (higher is better, range [0,1]), over the four publicly available memory systems, each run end-to-end under one fixed harness with an identical judge. Ordered by medium-tier mean; best per column in bold.

System	Small mean	Medium mean
gbrain	0.394	0.428
mem0-platform	0.221	0.303
zep-cloud	0.252	0.213
graphify-oss	0.206	0.142

gbrain uses its native synthesizer; the retrieval-only systems share one fixed neutral answerer (Claude Sonnet 4.6), so differences reflect the memory layer, not the generator. Even the strongest system leaves roughly six-tenths of the achievable credit on the table. The breakdown shows where: provenance and supersession hold up, but bi-temporal, justification-chain, and contradiction reasoning sit near the floor.

Bar chart of gbrain's mean rubric score across the six categories (C1-C6) for the small and medium tiers. C1 supersession and C2 provenance are highest (~0.5-0.76); C3 bi-temporal, C5 justification chains, and C6 contradiction sit near the floor (0.00-0.31).

_{Where even the front-runner fails. gbrain mean rubric score by category, small vs. medium tier. Full per-system, per-category results are in the paper.}

Single-seed results over small question sets (11 small, 73 medium); we attach no confidence intervals and read sub-0.1 between-system gaps as indicative, not significant.

Why it's hard

AI agents are increasingly deployed as operational participants inside companies. In that setting the binding failure is no longer "the agent doesn't know" — it is "the agent confidently acts on a fact that is no longer true."

Existing memory benchmarks don't exercise this regime. LoCoMo and LongMemEval model a single narrator's history and are near-saturated; BEAM pushes volume to millions of tokens but still varies how much one narrator said, not who decided what across an organization. Organizational memory is structurally different along three axes:

Graph-shaped — a decision is only actionable alongside the conversation that prompted it, the estimate that bounded it, and the sign-off that ratified it.
Bi-temporal — companies revise decisions, so memory must separate what was true then (valid time) from when it was recorded (ingestion time), preserving supersession instead of overwriting it.
Decision-traced — the why behind a fact is often worth more than the fact, and is how organizations avoid re-litigating settled questions.

The benchmark

The seed organization is Helix Logistics, a small-to-mid SaaS/freight-tech company with a coherent 2020–2026 arc. The substrate is a bi-temporal source-of-truth graph in which every fact carries four timestamps (valid_at, invalid_at, ingested_at, expired_at); facts are invalidated, never deleted, supersession links carry an explicit reason, and as_of date-travel reconstructs any past belief state. The data is synthetic by design — a freshly generated company can't appear in any system's pretraining, so a high score can't be explained by memorization.

Answers are scored by rubric-weighted facet coverage in [0,1] (partial credit per sub-criterion), not exact match. Each tier spans six categories:

Code	Category	What it tests
C1	Supersession	The current value of a fact plus what it replaced, when, and why.
C2	Decision provenance	Who decided, when, which alternatives were weighed, and the deciding rationale.
C3	Bi-temporal (as-of)	What the organization believed as of a past date, distinct from now.
C4	Audit replay	Reconstruct the knowledge state at a past date and flag what has since changed.
C5	Justification chain	The evidence supporting a conclusion, each item classed as direct testimony vs. inference.
C6	Contradiction	Detect that two artifacts conflict; report both sides and whether it was resolved.

This release ships two tiers (larger tiers are planned):

Tier	Artifacts	Tokens	Questions
Small	121	~60K	11
Medium	443	~200K	73

What a question looks like

C2 · Decision provenance — "Who led the decision to pivot Helix's architecture, when did they decide, what other options did they consider, and what was the main reason for the choice?"

A correct answer must recover a structured gold record — here decision, deciders, decision_date, alternatives, and deciding_factor — each scored as an independent rubric facet with partial credit. One sub-criterion, for instance, requires naming all three deciders (Maya Patel, Luis Hernandez, Arjun Mehta); paraphrase is accepted, omission is penalized deterministically. The answer is licensed by three corpus artifacts spanning the meeting where it was decided and the threads that confirmed it — so it can only be answered by stitching evidence across channels, not by recalling one document.

Quickstart

The harness runs in a container; each memory system under test runs as its own service.

# Build the harness image
docker build -t orgmembench .

# Corpus + question stats for a tier (free — no token spend)
docker run --rm orgmembench stats --tier small

# List the available system adapters
docker run --rm orgmembench list

Self-hosted systems need their backing infrastructure — bring up only what you run:

docker compose up -d qdrant      # mem0 vector store
docker compose up -d neo4j       # zep / graphiti temporal graph
docker compose up -d gbrain-pg   # gbrain Postgres backend

Hosted systems (mem0-platform, zep-cloud) need only an API key in your environment (see env.example; never commit keys). Runs are dry-run by default; pass --execute with the keys/services in place to run for real:

docker run --rm --env-file .env orgmembench run --system gbrain --tier small --execute
docker run --rm --env-file .env orgmembench leaderboard   # render results/ -> leaderboard.md

Per-system stand-up notes live in docker/; full reproduction detail is in docs/REPRODUCIBILITY.md.

Evaluate your own system

OrgMemBench is built to be extended, and the leaderboard is open.

Write an adapter — one small file in orgmembench/adapters/ (ingest, retrieve, and an optional native answerer). Guide: docs/CONTRIBUTING-AN-ADAPTER.md.
Run it — orgmembench run --system <name> --tier medium --execute.
Or score externally — already have predictions? Score them directly: orgmembench judge-submission --file preds.jsonl --system <name> --tier medium.

Open a PR with your adapter and results, and we'll add your system to the leaderboard.

Repository layout

OrgMemBench/
├── orgmembench/       evaluation harness: loaders, answerer, judge, runner,
│                      metrics, leaderboard, CLI + adapters/ (one per system)
├── datasets/helix/    the corpus: small/ and medium/ tiers (CC BY 4.0)
├── generation/        the open corpus-generation pipeline (helix_corpus)
├── config/            pinned, vendor-recommended config per system
├── docker/            per-system stand-up notes + docker-compose.yml
├── docs/              REPRODUCIBILITY.md, METHODOLOGY.md, adapter guide
├── paper/             the paper: LaTeX source + OrgMemBench.pdf
├── results/           where run output lands (empty until you run)
└── tests/             free smoke tests (no token spend)

License

Code (harness, adapters, generation): MIT — LICENSE.
Dataset (everything under datasets/): CC BY 4.0 — datasets/LICENSE. Attribution required; commercial use permitted.

Citation

@misc{gardner2026orgmembench,
  title        = {OrgMemBench: A Benchmark for Long-Horizon Organizational Memory in AI Agents},
  author       = {Gardner, Jack},
  year         = {2026},
  note         = {Preprint},
  howpublished = {\url{https://github.com/JackCGardner/OrgMemBench}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OrgMemBench

Leaderboard

Why it's hard

The benchmark

What a question looks like

Quickstart

Evaluate your own system

Repository layout

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
config		config
datasets		datasets
docker		docker
docs		docs
generation		generation
orgmembench		orgmembench
paper		paper
results		results
scripts		scripts
tests		tests
vendor		vendor
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
env.example		env.example
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

OrgMemBench

Leaderboard

Why it's hard

The benchmark

What a question looks like

Quickstart

Evaluate your own system

Repository layout

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages