engram

Persistent memory for AI agents. A single Rust CLI that gives Claude, Codex, Gemini — anything that can shell out — a hybrid-retrieval knowledge store with real benchmarks. No MCP server. No web service. No cloud dependency for the store itself.

git clone https://github.com/199-biotechnologies/engram-2
cd engram-2
cargo install --path crates/engram-cli --locked
engram skill install          # tells Claude/Codex/Gemini it exists
engram config set keys.gemini $GEMINI_API_KEY
engram remember "Rapamycin extends mouse lifespan via mTORC1 inhibition."
engram recall "what drug extends lifespan"    # finds it

The problem engram solves

Every LLM chat forgets everything when the window closes. The community's answer has been MCP servers: long-lived processes your agent connects to over a structured protocol. The problem is that MCP tool discovery costs ~44,000 tokens per session per server, the server has to be running, and every chat replays the whole thing.

engram takes the opposite bet: the binary is the interface. Your agent runs engram agent-info once (~1,400 tokens, 32× cheaper) to learn every command, then shells out to engram recall / engram remember / engram ingest exactly like it already uses gh and jq. Nothing to start, nothing to keep alive, nothing to crash.

The cost of this bet is that engram has to be demonstrably better at retrieval than the MCP alternatives. So we benchmarked it.

Benchmarks

Retrieval — LongMemEval S (500 questions, 96% distractors)

Full 500-question LongMemEval S split — 48 sessions per question, 96% distractors. Same dataset MemPalace reports against.

Pipeline	R@1	R@5	R@10	MRR
MemPalace (published `hybrid_v4`)	—	0.984	0.998	—
engram — hybrid only (Gemini Embed 2 + FTS5 + RRF)	0.910	0.990	0.998	0.946
engram — hybrid + Cohere Rerank (first 100 Qs)	0.930	0.980	1.000	0.957

engram beats MemPalace on R@5 by 0.6 points on retrieval alone — no reranking, no graph traversal, no AAAK compression, no PageRank. Adding Cohere rerank gains another ~4 points on R@1.

End-to-end QA (retrieve → LLM answer → LLM judge)

Retrieval numbers alone hide the real bottleneck. @parcadei tested MemPalace with an actual LLM answering questions using MemPalace's retrieved context, and got only 17% correct answers — despite the published R@5 of 0.984.

We implemented the same end-to-end evaluation for engram: retrieve top-k → pass to openai/gpt-5.4 to answer → judge correctness with openai/gpt-5.4. Per-question results, token counts, and cost are saved to benchmarks/.

Suite	Sample	Correct	Accuracy	R@5	MRR	Notes
LongMemEval-QA	2	2	100%	1.00	1.00	Easy single-session questions
LongMemEval-QA	3	1	33%	1.00	1.00	Retrieval perfect, 1 interpretation error + 1 false refusal
LoCoMo-QA	5	2	40%	—	—	Short multi-session test
LoCoMo-QA	50	14	28%	—	—	First stable QA number on a harder dataset

The 17% gap is real for everyone — not just MemPalace. Our own retrieval is near-perfect (MRR = 1.0 on LongMemEval-QA), but the answerer LLM:

Interprets "daily commute" as round-trip (90 min) when the reference is one-way (45 min)
Refuses to answer with "I don't know" even when the answer is in the retrieved context
Fails on LoCoMo's harder multi-session reasoning

These aren't engram bugs, they're the state of the art. Retrieval R@5 ≠ answer accuracy. Measuring only retrieval — as MemPalace did — hides the real problem.

What this shows about MemPalace's claims: their published 0.984 R@5 is probably real as a retrieval number, but the claim that "MemPalace is the best agent memory system" rests on conflating retrieval with end-to-end correctness. The critical thread from Han Xiao (Jina AI) dissects this further.

RAGAS metrics (LLM-as-judge, four orthogonal dimensions)

Run engram bench longmemeval-qa --ragas to compute four additional metrics on top of correctness: faithfulness (no hallucination), answer relevance (on-topic), context precision (retrieved chunks are all useful), context recall (every fact in the gold answer is in the retrieved chunks). Each adds 4 LLM calls per question, so run sparingly.

Reproducing

# Retrieval only (fast, no LLM judge):
engram bench longmemeval --json                          # full 500
engram bench longmemeval --limit 50 --json               # first 50
engram bench mini --json                                 # 10-question smoke

# End-to-end QA (requires OPENROUTER_API_KEY for answerer + judge):
engram bench longmemeval-qa --limit 20 --json            # ~50 minutes on free Gemini tier
engram bench longmemeval-qa --limit 20 --ragas --json    # + 4 extra LLM calls/question
engram bench locomo-qa --limit 50 --json                 # ~3 minutes

# Every run saves a timestamped report to benchmarks/
ls benchmarks/

All runs are logged with full per-question detail, token counts, and model IDs to benchmarks/ so you can audit failures or rerun the judge with a different prompt without re-embedding. See benchmarks/README.md for the report schema.

Install

# Prerequisite: Rust 1.80+ (install via rustup.rs if needed)
git clone https://github.com/199-biotechnologies/engram-2
cd engram-2
cargo install --path crates/engram-cli --locked

One binary at ~/.cargo/bin/engram. No runtime, no Python, no Docker, no services. engram --version should print engram 0.1.0.

Configure keys

# Required for real hybrid retrieval. Free tier at https://aistudio.google.com/apikey
engram config set keys.gemini $GEMINI_API_KEY

# Optional — adds ~4 R@1 points via reranking. https://dashboard.cohere.com/api-keys
engram config set keys.cohere $COHERE_API_KEY

engram config check
# -> { "gemini": "configured", "cohere": "configured (optional)", "ok": true }

Keys are resolved in order: explicit env var → ~/.config/engram/config.toml → none. Config file is written with 0600 perms (user-only). Without Gemini, recall falls back to a deterministic offline stub — useful for CI, unusable for real quality.

Tell your agents about it

engram skill install

This writes a SKILL.md signpost to ~/.claude/skills/engram/, ~/.codex/skills/engram/, and ~/.gemini/skills/engram/. Any agent that reads those directories will discover engram, learn the memory loop pattern, and start using it autonomously.

The memory loop (how agents should use engram)

The installed skill teaches your agent to do this every task:

# 1. LOAD — recall anything relevant before answering
engram recall "user's task in 4-6 words" --top-k 5 --json

# 2. WORK — do the task, citing recalled chunks when they matter

# 3. SAVE — whatever the user told you that will matter later
engram remember "Boris prefers Rust over Go for CLI tools."           --importance 7 --tag preference
engram remember "Decision 2026-04-08: use BLOB embeddings in SQLite." --importance 9 --tag decision

Rule of thumb: save preferences, explicit decisions with rationale, stable facts, and corrections. Don't save task-local state or conversation filler.

Scientific papers workflow

engram is purpose-built for ingesting and querying research papers with real citations.

# Drop PDFs in a directory
curl -sL -o paper.pdf https://arxiv.org/pdf/2405.14831.pdf   # HippoRAG
curl -sL -o bert.pdf  https://arxiv.org/pdf/1810.04805.pdf   # BERT

# Ingest. This runs pdf-extract -> section-aware chunking (preserves
# "Methods > Cell Culture" breadcrumbs) -> Gemini Embedding 2 (batched,
# token-budgeted) -> SQLite BLOBs. Embeddings persist forever.
engram ingest . --mode papers

# Ask questions. Returns the exact chunks with scores and sources.
engram recall "personalized pagerank for multi-hop retrieval" --top-k 3 --json

# Browse what engram extracted from the corpus
engram entities list --limit 10
# -> BERT (58), HippoRAG (56), LightRAG (52), LLM (39), RAG (36), ...

Each result has chunk_id, score, content, and sources: ["dense","lexical","reranker"]. Your agent should quote the content and cite the chunk_id so you can always re-run engram recall to verify a claim.

Tested on 5 arXiv papers (Attention, BERT, HippoRAG, LightRAG, RAG — 1,171 chunks) in 21 seconds end-to-end.

Architecture

        query
          │
 ┌────────┴────────┐
 │                 │
 ▼                 ▼
Dense          Lexical
(Gemini        (FTS5
 Embed 2        BM25 over
 batched +      chunks.content
 cached)        in SQLite)
 │                 │
 └────────┬────────┘
          │
          ▼
 Reciprocal Rank Fusion
 (k=60, deterministic tiebreak)
          │
          ▼
 (optional) Cohere Rerank 4 Pro
 reranks the top 50 candidates
          │
          ▼
 Memory layer budgeting
 (L0 identity / L1 critical /
  L2 topic / L3 deep)
          │
          ▼
 JSON envelope on stdout,
 errors on stderr,
 exit codes 0-4

SQLite is the source of truth. Chunks store their embedding as a little-endian f32 BLOB plus an embed_model tag.
FTS5 is the lexical index, included in the same database file.
No separate vector server — at personal scale (<100K vectors) brute-force cosine in Rust is fast enough. We skipped Qdrant and LanceDB on purpose.
Deterministic everything: UUID v5 for IDs, stable sort tiebreak in fusion, reproducible bench runs.

Cargo workspace layout:

Crate	Purpose
`engram-core`	Pure types, fusion (RRF), memory layers, AAAK compression, temporal validity. Zero I/O.
`engram-storage`	SQLite source of truth + FTS5 + chunk-embedding BLOBs.
`engram-embed`	`Embedder` trait + Gemini Embed 2 (batch + single) + deterministic offline stub.
`engram-rerank`	`Reranker` trait + Cohere Rerank 4 Pro + passthrough.
`engram-ingest`	Mining modes: papers (PDF + section-aware), conversations, repos, general, auto.
`engram-graph`	Deterministic entity extraction + graph scaffolding.
`engram-bench`	LongMemEval harness + inline mini bench.
`engram-cli`	The single `engram` binary and the shared hybrid retrieval pipeline.

Framework compliance

engram follows the agent-cli-framework verbatim:

agent-info returns a raw JSON manifest (not enveloped) so agents can discover every command in one call
JSON envelope on every other stdout path (version, status, data, metadata)
Errors on stderr with code, message, suggestion, exit_code
Semantic exit codes: 0 success, 1 transient (retry), 2 config (fix setup), 3 bad input (fix args), 4 rate limited (back off)
No interactive prompts. Destructive ops like forget require --confirm
XDG paths everywhere (~/.config/engram/, ~/.local/share/engram/, ~/.cache/engram/)
Skill file embedded in the binary as a compile-time constant and deployed via engram skill install
Secrets resolved in order: env var → config file → none. Always masked on display (AIzaSy...DW58)

All the commands (`engram agent-info` for the full manifest)


`engram remember <content>`	Store a memory. Flags: `--importance 0-10`, `--tag` (repeatable), `--diary`
`engram recall <query>`	Hybrid search. Flags: `--top-k`, `--layer identity\|critical\|topic\|deep`, `--diary`, `--since`, `--until`
`engram ingest <path>`	Mine a file or directory. `--mode papers\|conversations\|repos\|general\|auto`
`engram edit <id>`	Update memory content or importance
`engram forget <id> --confirm`	Soft-delete (destructive, requires `--confirm`)
`engram entities list \| show <name>`	Browse extracted entities
`engram export` / `engram import <file>`	JSON backup / restore
`engram bench <mini\|mini-fts\|longmemeval>`	Run benchmarks
`engram config show \| set \| check`	Configuration
`engram skill install \| uninstall`	Deploy agent skill signpost
`engram agent-info`	Self-describing manifest (start here)

Development

cargo build --release                         # build
cargo test                                    # 27 unit + 18 integration tests
./target/release/engram bench mini --json     # fast smoke bench (<1s)
./target/release/engram bench longmemeval     # real benchmark (~5 min with Cohere)

Research direction for contributors: program.md — enumerates the hyperparameters and architecture experiments worth running via autoresearch loops. Design rationale: docs/superpowers/specs/2026-04-07-engram-v2-design.md.

Roadmap

Shipped (v0.1.0)

Single-binary install, hybrid Gemini + FTS5 + RRF retrieval
Persistent SQLite store with chunk-embedding BLOBs
Full CRUD (remember, recall, edit, forget, export, import)
Mining modes for papers, conversations, repos, general
PDF ingestion via pdf-extract
Section-aware chunking, AAAK compression prototype
Cohere Rerank 4 Pro wired as optional lift
Memory layers (L0–L3) with token budgeting
Diary namespaces for specialist agents
Entity extraction and browsing
LongMemEval harness (Oracle + S splits)
45 unit + integration tests

Next up

GitHub Actions CI releasing prebuilt macOS + Linux binaries
cargo install engram-cli from crates.io
engram update --check wired to real GitHub Releases
Local embedding fallback via candle + bge-small-en-v1.5 (zero API, p95 < 10 ms)
ENGRAM_RERANK_TOP_N knob to cut Cohere cost ~60% with minimal quality loss
Graph expansion on retrieval (deterministic edges already extracted)

Credits

Inspired by:

MemPalace — spatial memory + AAAK compression philosophy
HippoRAG 2 — "return verbatim passages, don't paraphrase"
LongMemEval — the benchmark we aimed at
agent-cli-framework — the principles engram follows verbatim

License

MIT — see LICENSE.

Built by 199 Biotechnologies. Questions? Open an issue. Pull requests welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.autoresearch		.autoresearch
benchmarks		benchmarks
crates		crates
data		data
docs/superpowers/specs		docs/superpowers/specs
thoughts/shared/handoffs/zerank2-locomo-bench		thoughts/shared/handoffs/zerank2-locomo-bench
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
HANDOFF.md		HANDOFF.md
README.md		README.md
autoresearch.toml		autoresearch.toml
program.md		program.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

engram

The problem engram solves

Benchmarks

Retrieval — LongMemEval S (500 questions, 96% distractors)

End-to-end QA (retrieve → LLM answer → LLM judge)

RAGAS metrics (LLM-as-judge, four orthogonal dimensions)

Reproducing

Install

Configure keys

Tell your agents about it

The memory loop (how agents should use engram)

Scientific papers workflow

Architecture

Framework compliance

All the commands (`engram agent-info` for the full manifest)

Development

Roadmap

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

engram

The problem engram solves

Benchmarks

Retrieval — LongMemEval S (500 questions, 96% distractors)

End-to-end QA (retrieve → LLM answer → LLM judge)

RAGAS metrics (LLM-as-judge, four orthogonal dimensions)

Reproducing

Install

Configure keys

Tell your agents about it

The memory loop (how agents should use engram)

Scientific papers workflow

Architecture

Framework compliance

All the commands (engram agent-info for the full manifest)

Development

Roadmap

Credits

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

All the commands (`engram agent-info` for the full manifest)

Packages