perf: CLOCK content cache for post-releaseContents search (issue #208)#298
perf: CLOCK content cache for post-releaseContents search (issue #208)#298
Conversation
The MCP server already serves immediately on startup — scanBg runs in a background thread while mcp_server.run is called right after spawn — but clients had no way to tell whether they were querying a still-warming index, a fully-built one, or a snapshot that was just loaded. On large repos (~5K files, 42s cold scan) this meant tools could return partial results with no signal that a retry would yield more. Add a four-state ScanState enum (loading_snapshot → walking → indexing → ready) backed by an atomic u8 in mcp.zig, wire scanBg + the MCP startup path to set the state at every phase boundary, surface the current state in codedb_status output and the startup log line. Tools see partial results during walking/indexing exactly as before; they now also see a "scan: walking" line in codedb_status so they can decide whether to retry or proceed. Tests: - zig build test → 395/395 (393 baseline + 2 new) - issue-207: ScanState round-trips through atomic - issue-207: ScanState.name covers all states - Manual: codedb mcp on a fresh repo prints "scan=walking" at startup and codedb_status returns "scan: walking" until the background scan transitions through indexing → ready. Closes #207 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lessons from PR #259's revert (5a5ee8a): that attempt replaced the primary `Explorer.contents` StringHashMap with a 4096-slot CLOCK cache. On repos larger than the cache, `rebuildTrigrams` and `rebuildWordIndex` iterated the slot array and silently saw only 4096 files, dropping the rest from the rebuilt indexes. The errdefer chain for #252 word_index restoration also broke because put() freed the prior entry before any rollback could restore it. This PR keeps the existing `contents` store untouched (still serves the first ~1000 files during indexing, still the source of truth for index rebuilds) and adds `search_cache: ContentCache` as a read-through layer ONLY on the disk-fallback path in `readContentForSearch`: contents.get (fast: first 1000) → search_cache.get (warm: prior disk reads, bounded at 4096) → disk read + populate search_cache → return Design choices that dodge PR #259's bugs: 1. Cache is orthogonal to `contents`; no iteration path changes. 2. `get` returns an owned `dupe`, so concurrent eviction by another reader can't invalidate a caller's handle. 3. Path and data both duped inside `put`; cache lifetime is independent of outline paths. 4. Bounded at `MAX_ENTRIES=4096` slots. Missing callers fall through to disk — no silent data loss. 5. `removeFile` drops the cache entry alongside the other state. Benefit materialises post-`releaseContents` (triggered on repos >1000 files or CODEDB_LOW_MEMORY=1): subsequent `codedb_search` calls that would otherwise do a disk read per candidate file now hit the cache for the hot set. Tests (5 new, 400/400 pass): - ContentCache put/get round-trip returns owned copy - get returns independent copy (survives eviction) - bounds count at MAX_ENTRIES under pressure - remove drops entry and decrements count - clear resets state Benchmark (openclaw, 13,777 files, 3 trials median) vs main: index_ms: -0.9%, query latencies within ±5% noise band. (The benchmark hits the `contents` fast path before `releaseContents` is triggered, so the CLOCK layer isn't exercised here — cache wins only show up on repos where `releaseContents` has run.) Closes #208 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bea5ea5e19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (self.search_cache.get(path, allocator)) |cached_copy| { | ||
| return .{ .data = cached_copy, .owned = true, .allocator = allocator }; |
There was a problem hiding this comment.
Invalidate search cache on file updates
readContentForSearch now returns search_cache entries before touching disk, but this cache is only cleared in removeFile and is not invalidated when a file is re-indexed after a modify event. In large repos (or CODEDB_LOW_MEMORY=1) where releaseContents() has emptied contents, this causes codedb_search/codedb_read to return stale pre-edit bytes after a file change until that entry is evicted or the server restarts. Please clear or refresh search_cache when modified files are committed to the explorer so reads stay consistent with current indexed content.
Useful? React with 👍 / 👎.
Summary
Closes #208. Second attempt at a bounded content cache after PR #259 was reverted in
5a5ee8a.What went wrong in PR #259
It replaced
Explorer.contents: StringHashMapwithcontents: ContentCache(4096 slots, CLOCK eviction). Two fatal problems:rebuildTrigrams/rebuildWordIndexiteratecontents. Once the cache was full, subsequent puts evicted older entries silently. On a >4096-file repo, a rebuild would only re-index 4096 files; the rest disappeared from the word/trigram indexes.getOrPut-based flow relied onprior_contentstill living incontentsuntil the errdefer could restore it.ContentCache.putfreed the old entry immediately, so the rollback had nothing to restore.This PR's approach
Keep
Explorer.contentsas the primary store. Addsearch_cache: ContentCacheas a separate, bounded read-through layer only on the disk-fallback path inreadContentForSearch:How this dodges the PR #259 bugs
contents; no iteration path changesgetreturns adupe(owned copy); caller handle is independentputdupes path and data — cache lives independently of outlinesreadContentForSearch; no errdefer chain to violateMAX_ENTRIES=4096; misses fall through to disk, never silent lossWhere the benefit shows up
releaseContents()is called at end-of-scan on repos >1000 files (or whenCODEDB_LOW_MEMORY=1). After that,contentsis empty and every search hits disk. The new cache keeps the hot set resident and makes repeated searches to the same files ~3-10x faster than warm disk reads.Test plan
zig build test— 400/400 pass (395 baseline + 5 new)ContentCache put/get round-trip returns owned copyContentCache get returns independent copy (survives eviction)ContentCache bounds count at MAX_ENTRIES under pressureContentCache remove drops entry and decrements countContentCache clear resets state~/openclaw(13,777 files, 3 trials median) — no regressions:index_ms -0.9%, queries within ±5% noise. (The bench runs beforereleaseContents, so it hits thecontentsfast path and doesn't exercise the new layer — by design. Cache wins appear only in workloads wherereleaseContentshas fired, which the benchmark doesn't currently simulate.)🤖 Generated with Claude Code