Summary
The current RAG system uses FTS5 (BM25) + vector cosine similarity in parallel,
but the results are not re-ranked and chunking is fixed-size. This leads to
irrelevant context being injected and relevant context being missed on long
documents.
Improvements
1. Semantic chunking
Replace fixed-size chunking with sentence/paragraph-aware splitting:
- Split on sentence boundaries rather than arbitrary token counts
- Respect markdown headers as natural chunk boundaries
- Configurable overlap between chunks
2. Hybrid search scoring (RRF)
Combine BM25 and vector scores using Reciprocal Rank Fusion instead of
running them in parallel and taking a union:
rrf_score = 1/(k + bm25_rank) + 1/(k + vector_rank)
This produces a single ranked list with better precision.
3. Re-ranking (optional cross-encoder)
When a cross-encoder model is available locally (e.g. via Ollama),
use it to re-rank the top-N candidates before injecting into context.
4. Memory TTL and auto-pruning
Add a configurable TTL for memory entries — old, rarely-accessed records
are pruned automatically to keep the vector store lean.
5. Namespace isolation
Ensure RAG queries are always scoped to the current session namespace
to prevent cross-session memory leakage.
Acceptance criteria
Summary
The current RAG system uses FTS5 (BM25) + vector cosine similarity in parallel,
but the results are not re-ranked and chunking is fixed-size. This leads to
irrelevant context being injected and relevant context being missed on long
documents.
Improvements
1. Semantic chunking
Replace fixed-size chunking with sentence/paragraph-aware splitting:
2. Hybrid search scoring (RRF)
Combine BM25 and vector scores using Reciprocal Rank Fusion instead of
running them in parallel and taking a union:
This produces a single ranked list with better precision.
3. Re-ranking (optional cross-encoder)
When a cross-encoder model is available locally (e.g. via Ollama),
use it to re-rank the top-N candidates before injecting into context.
4. Memory TTL and auto-pruning
Add a configurable TTL for memory entries — old, rarely-accessed records
are pruned automatically to keep the vector store lean.
5. Namespace isolation
Ensure RAG queries are always scoped to the current session namespace
to prevent cross-session memory leakage.
Acceptance criteria
internal/rag/chunker.gointernal/rag/search.goreplacing the current union approachvector_recordstable, pruning job runs on store opendocs/updated (RAG section in architecture.md or newdocs/rag.md)