-
Notifications
You must be signed in to change notification settings - Fork 174
Description
Problem
Temporal reasoning is our weakest benchmark category:
- BM: 59.1% R@5 on temporal queries
- Backboard: 91.9% accuracy on temporal (best in class)
- Zep: 79.8%, Memobase: 85.1%
We're losing 30+ points to competitors on time-based questions like 'When did X happen?' or 'What changed between session 2 and session 3?'
Root Cause (NOT recency bias)
The problem is NOT that old content ranks too high. Old content can be highly relevant — a founding document from a year ago may matter more than yesterday's chat. Time-decay scoring is the wrong approach.
The real issues:
- Time isn't a searchable dimension — dates exist in the database (
created_at,updated_at, frontmatterdatefields) but aren't used as retrieval signals - Can't answer 'when' questions — 'When did X happen?' requires finding the note about X and extracting its date, not boosting recent notes
- No temporal range queries — 'before the trip', 'after May', 'between session 2 and 3' can't be expressed as search filters
- Session boundaries aren't explicit — multi-session conversations (like LoCoMo) need session IDs as searchable metadata
Proposed Improvements
1. Date as a first-class search dimension
The data is already in the database. Surface it as search filters:
after_date/before_dateparameters on searchdate_rangefilter (between two dates)session_idfilter for conversation-scoped queries- These should compose with existing text/semantic search, not replace it
2. Temporal entity extraction
When notes contain temporal references, index them:
- Frontmatter
date,created,started,completedfields → indexed as temporal metadata - Conversation timestamps (### HH:MM headers) → queryable
- Calendar dates mentioned in content → extracted and indexed (stretch goal)
3. Temporal query detection and routing
Recognize temporal intent in queries and auto-apply filters:
- 'When did X happen?' → find notes about X, return with dates
- 'What did we discuss last week?' →
after_date: 7d agofilter - 'Before/after [event]' → resolve event date, apply range filter
- Could be simple heuristics ('when did', 'last time', 'before', 'after', 'in May') or lightweight classification
4. Temporal result enrichment
When returning results for temporal queries, include the temporal context:
- Surface
created_at/datein search results - Order by date when query has temporal intent (not by relevance score)
- Group by session/date when multiple results span time
What Backboard Does Right
Their benchmark passes custom_timestamp metadata with every ingested turn. Their system indexes on it. When LoCoMo asks 'when did Sarah mention the restaurant?', they do a targeted temporal lookup — not recency boosting. Result: 91.9% temporal accuracy.
What We Already Have
created_at,updated_aton every entity in SQLite- Frontmatter
datefields parsed and stored after_dateparameter already partially supported in search API- Conversation notes with timestamp headers
What We Need
- Wire the existing date fields into the search scoring/filtering pipeline
- Add temporal query detection (can be simple heuristics to start)
- Ensure search results include temporal metadata in the response
- Benchmark temporal queries specifically to measure improvement
References
- Backboard benchmark: github.com/Backboard-io/Backboard-Locomo-Benchmark (91.9% temporal)
- Our benchmark results: R@5 59.1% temporal, 86.0% overall
- Time-decay discussion: Feature: Time-decay and usage-weighted relevance scoring for search #603 (separate concern — may still be useful as opt-in but not the fix for temporal retrieval)
Milestone
v0.19.0