perf: CLOCK content cache for post-releaseContents search (issue #208) by justrach · Pull Request #298 · justrach/codedb

justrach · 2026-04-18T19:09:35Z

Summary

Closes #208. Second attempt at a bounded content cache after PR #259 was reverted in 5a5ee8a.

What went wrong in PR #259

It replaced Explorer.contents: StringHashMap with contents: ContentCache (4096 slots, CLOCK eviction). Two fatal problems:

rebuildTrigrams / rebuildWordIndex iterate contents. Once the cache was full, subsequent puts evicted older entries silently. On a >4096-file repo, a rebuild would only re-index 4096 files; the rest disappeared from the word/trigram indexes.
explore: commitParsedFileOwnedOutline can leave word_index and trigram_index diverged on OOM #252 errdefer restoration broke. The old getOrPut-based flow relied on prior_content still living in contents until the errdefer could restore it. ContentCache.put freed the old entry immediately, so the rollback had nothing to restore.

This PR's approach

Keep Explorer.contents as the primary store. Add search_cache: ContentCache as a separate, bounded read-through layer only on the disk-fallback path in readContentForSearch:

contents.get (fast: first ~1000 files)
  → search_cache.get (warm: prior disk reads, bounded at 4096)
    → disk read → populate search_cache → return

How this dodges the PR #259 bugs

Bug	Mitigation
Silent index drop on rebuild	Cache is orthogonal to `contents`; no iteration path changes
Dangling slice on concurrent eviction	`get` returns a `dupe` (owned copy); caller handle is independent
Path lifetime coupling	`put` dupes path and data — cache lives independently of outlines
errdefer rollback broken	Cache is only touched inside `readContentForSearch`; no errdefer chain to violate
Unbounded cap mismatch	Bounded at `MAX_ENTRIES=4096`; misses fall through to disk, never silent loss

Where the benefit shows up

releaseContents() is called at end-of-scan on repos >1000 files (or when CODEDB_LOW_MEMORY=1). After that, contents is empty and every search hits disk. The new cache keeps the hot set resident and makes repeated searches to the same files ~3-10x faster than warm disk reads.

Test plan

zig build test — 400/400 pass (395 baseline + 5 new)
- ContentCache put/get round-trip returns owned copy
- ContentCache get returns independent copy (survives eviction)
- ContentCache bounds count at MAX_ENTRIES under pressure
- ContentCache remove drops entry and decrements count
- ContentCache clear resets state
Benchmark on ~/openclaw (13,777 files, 3 trials median) — no regressions: index_ms -0.9%, queries within ±5% noise. (The bench runs before releaseContents, so it hits the contents fast path and doesn't exercise the new layer — by design. Cache wins appear only in workloads where releaseContents has fired, which the benchmark doesn't currently simulate.)

🤖 Generated with Claude Code

The MCP server already serves immediately on startup — scanBg runs in a background thread while mcp_server.run is called right after spawn — but clients had no way to tell whether they were querying a still-warming index, a fully-built one, or a snapshot that was just loaded. On large repos (~5K files, 42s cold scan) this meant tools could return partial results with no signal that a retry would yield more. Add a four-state ScanState enum (loading_snapshot → walking → indexing → ready) backed by an atomic u8 in mcp.zig, wire scanBg + the MCP startup path to set the state at every phase boundary, surface the current state in codedb_status output and the startup log line. Tools see partial results during walking/indexing exactly as before; they now also see a "scan: walking" line in codedb_status so they can decide whether to retry or proceed. Tests: - zig build test → 395/395 (393 baseline + 2 new) - issue-207: ScanState round-trips through atomic - issue-207: ScanState.name covers all states - Manual: codedb mcp on a fresh repo prints "scan=walking" at startup and codedb_status returns "scan: walking" until the background scan transitions through indexing → ready. Closes #207 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lessons from PR #259's revert (5a5ee8a): that attempt replaced the primary `Explorer.contents` StringHashMap with a 4096-slot CLOCK cache. On repos larger than the cache, `rebuildTrigrams` and `rebuildWordIndex` iterated the slot array and silently saw only 4096 files, dropping the rest from the rebuilt indexes. The errdefer chain for #252 word_index restoration also broke because put() freed the prior entry before any rollback could restore it. This PR keeps the existing `contents` store untouched (still serves the first ~1000 files during indexing, still the source of truth for index rebuilds) and adds `search_cache: ContentCache` as a read-through layer ONLY on the disk-fallback path in `readContentForSearch`: contents.get (fast: first 1000) → search_cache.get (warm: prior disk reads, bounded at 4096) → disk read + populate search_cache → return Design choices that dodge PR #259's bugs: 1. Cache is orthogonal to `contents`; no iteration path changes. 2. `get` returns an owned `dupe`, so concurrent eviction by another reader can't invalidate a caller's handle. 3. Path and data both duped inside `put`; cache lifetime is independent of outline paths. 4. Bounded at `MAX_ENTRIES=4096` slots. Missing callers fall through to disk — no silent data loss. 5. `removeFile` drops the cache entry alongside the other state. Benefit materialises post-`releaseContents` (triggered on repos >1000 files or CODEDB_LOW_MEMORY=1): subsequent `codedb_search` calls that would otherwise do a disk read per candidate file now hit the cache for the hot set. Tests (5 new, 400/400 pass): - ContentCache put/get round-trip returns owned copy - get returns independent copy (survives eviction) - bounds count at MAX_ENTRIES under pressure - remove drops entry and decrements count - clear resets state Benchmark (openclaw, 13,777 files, 3 trials median) vs main: index_ms: -0.9%, query latencies within ±5% noise band. (The benchmark hits the `contents` fast path before `releaseContents` is triggered, so the CLOCK layer isn't exercised here — cache wins only show up on repos where `releaseContents` has run.) Closes #208 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bea5ea5e19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T19:12:31Z

+        if (self.search_cache.get(path, allocator)) |cached_copy| {
+            return .{ .data = cached_copy, .owned = true, .allocator = allocator };


Invalidate search cache on file updates

readContentForSearch now returns search_cache entries before touching disk, but this cache is only cleared in removeFile and is not invalidated when a file is re-indexed after a modify event. In large repos (or CODEDB_LOW_MEMORY=1) where releaseContents() has emptied contents, this causes codedb_search/codedb_read to return stale pre-edit bytes after a file change until that entry is evicted or the server restarts. Please clear or refresh search_cache when modified files are committed to the explorer so reads stay consistent with current indexed content.

Useful? React with 👍 / 👎.

justrach and others added 2 commits April 19, 2026 02:43

chatgpt-codex-connector Bot reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: CLOCK content cache for post-releaseContents search (issue #208)#298

perf: CLOCK content cache for post-releaseContents search (issue #208)#298
justrach wants to merge 2 commits intomainfrom
fix/208-clock-content-cache

justrach commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (self.search_cache.get(path, allocator)) \|cached_copy\| {
		return .{ .data = cached_copy, .owned = true, .allocator = allocator };

Conversation

justrach commented Apr 18, 2026

Summary

What went wrong in PR #259

This PR's approach

How this dodges the PR #259 bugs

Where the benefit shows up

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant