Skip to content

perf: CLOCK content cache for post-releaseContents search (issue #208)#298

Open
justrach wants to merge 2 commits intomainfrom
fix/208-clock-content-cache
Open

perf: CLOCK content cache for post-releaseContents search (issue #208)#298
justrach wants to merge 2 commits intomainfrom
fix/208-clock-content-cache

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

Closes #208. Second attempt at a bounded content cache after PR #259 was reverted in 5a5ee8a.

What went wrong in PR #259

It replaced Explorer.contents: StringHashMap with contents: ContentCache (4096 slots, CLOCK eviction). Two fatal problems:

  1. rebuildTrigrams / rebuildWordIndex iterate contents. Once the cache was full, subsequent puts evicted older entries silently. On a >4096-file repo, a rebuild would only re-index 4096 files; the rest disappeared from the word/trigram indexes.
  2. explore: commitParsedFileOwnedOutline can leave word_index and trigram_index diverged on OOM #252 errdefer restoration broke. The old getOrPut-based flow relied on prior_content still living in contents until the errdefer could restore it. ContentCache.put freed the old entry immediately, so the rollback had nothing to restore.

This PR's approach

Keep Explorer.contents as the primary store. Add search_cache: ContentCache as a separate, bounded read-through layer only on the disk-fallback path in readContentForSearch:

contents.get (fast: first ~1000 files)
  → search_cache.get (warm: prior disk reads, bounded at 4096)
    → disk read → populate search_cache → return

How this dodges the PR #259 bugs

Bug Mitigation
Silent index drop on rebuild Cache is orthogonal to contents; no iteration path changes
Dangling slice on concurrent eviction get returns a dupe (owned copy); caller handle is independent
Path lifetime coupling put dupes path and data — cache lives independently of outlines
errdefer rollback broken Cache is only touched inside readContentForSearch; no errdefer chain to violate
Unbounded cap mismatch Bounded at MAX_ENTRIES=4096; misses fall through to disk, never silent loss

Where the benefit shows up

releaseContents() is called at end-of-scan on repos >1000 files (or when CODEDB_LOW_MEMORY=1). After that, contents is empty and every search hits disk. The new cache keeps the hot set resident and makes repeated searches to the same files ~3-10x faster than warm disk reads.

Test plan

  • zig build test400/400 pass (395 baseline + 5 new)
    • ContentCache put/get round-trip returns owned copy
    • ContentCache get returns independent copy (survives eviction)
    • ContentCache bounds count at MAX_ENTRIES under pressure
    • ContentCache remove drops entry and decrements count
    • ContentCache clear resets state
  • Benchmark on ~/openclaw (13,777 files, 3 trials median) — no regressions: index_ms -0.9%, queries within ±5% noise. (The bench runs before releaseContents, so it hits the contents fast path and doesn't exercise the new layer — by design. Cache wins appear only in workloads where releaseContents has fired, which the benchmark doesn't currently simulate.)

🤖 Generated with Claude Code

justrach and others added 2 commits April 19, 2026 02:43
The MCP server already serves immediately on startup — scanBg runs in a
background thread while mcp_server.run is called right after spawn — but
clients had no way to tell whether they were querying a still-warming
index, a fully-built one, or a snapshot that was just loaded. On large
repos (~5K files, 42s cold scan) this meant tools could return partial
results with no signal that a retry would yield more.

Add a four-state ScanState enum (loading_snapshot → walking → indexing
→ ready) backed by an atomic u8 in mcp.zig, wire scanBg + the MCP
startup path to set the state at every phase boundary, surface the
current state in codedb_status output and the startup log line.

Tools see partial results during walking/indexing exactly as before;
they now also see a "scan: walking" line in codedb_status so they can
decide whether to retry or proceed.

Tests:
- zig build test → 395/395 (393 baseline + 2 new)
  - issue-207: ScanState round-trips through atomic
  - issue-207: ScanState.name covers all states
- Manual: codedb mcp on a fresh repo prints "scan=walking" at startup
  and codedb_status returns "scan: walking" until the background
  scan transitions through indexing → ready.

Closes #207

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lessons from PR #259's revert (5a5ee8a): that attempt replaced the
primary `Explorer.contents` StringHashMap with a 4096-slot CLOCK cache.
On repos larger than the cache, `rebuildTrigrams` and `rebuildWordIndex`
iterated the slot array and silently saw only 4096 files, dropping the
rest from the rebuilt indexes. The errdefer chain for #252 word_index
restoration also broke because put() freed the prior entry before any
rollback could restore it.

This PR keeps the existing `contents` store untouched (still serves the
first ~1000 files during indexing, still the source of truth for index
rebuilds) and adds `search_cache: ContentCache` as a read-through layer
ONLY on the disk-fallback path in `readContentForSearch`:

  contents.get (fast: first 1000)
    → search_cache.get (warm: prior disk reads, bounded at 4096)
      → disk read + populate search_cache
        → return

Design choices that dodge PR #259's bugs:
 1. Cache is orthogonal to `contents`; no iteration path changes.
 2. `get` returns an owned `dupe`, so concurrent eviction by another
    reader can't invalidate a caller's handle.
 3. Path and data both duped inside `put`; cache lifetime is independent
    of outline paths.
 4. Bounded at `MAX_ENTRIES=4096` slots. Missing callers fall through
    to disk — no silent data loss.
 5. `removeFile` drops the cache entry alongside the other state.

Benefit materialises post-`releaseContents` (triggered on repos >1000
files or CODEDB_LOW_MEMORY=1): subsequent `codedb_search` calls that
would otherwise do a disk read per candidate file now hit the cache
for the hot set.

Tests (5 new, 400/400 pass):
- ContentCache put/get round-trip returns owned copy
- get returns independent copy (survives eviction)
- bounds count at MAX_ENTRIES under pressure
- remove drops entry and decrements count
- clear resets state

Benchmark (openclaw, 13,777 files, 3 trials median) vs main:
  index_ms: -0.9%, query latencies within ±5% noise band.
  (The benchmark hits the `contents` fast path before `releaseContents`
  is triggered, so the CLOCK layer isn't exercised here — cache wins
  only show up on repos where `releaseContents` has run.)

Closes #208

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bea5ea5e19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/explore.zig
Comment on lines +1250 to +1251
if (self.search_cache.get(path, allocator)) |cached_copy| {
return .{ .data = cached_copy, .owned = true, .allocator = allocator };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate search cache on file updates

readContentForSearch now returns search_cache entries before touching disk, but this cache is only cleared in removeFile and is not invalidated when a file is re-indexed after a modify event. In large repos (or CODEDB_LOW_MEMORY=1) where releaseContents() has emptied contents, this causes codedb_search/codedb_read to return stale pre-edit bytes after a file change until that entry is evicted or the server restarts. Please clear or refresh search_cache when modified files are committed to the explorer so reads stay consistent with current indexed content.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: CLOCK eviction cache for file contents

1 participant