You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Files larger than 64KB skip trigram indexing (watcher.zig:446). They land in Explorer.skip_trigram_files and are only reachable via Tier 3 of searchContent — which runs AFTER Tier 1 (trigram candidates) fills the max_results quota.
For a common identifier mentioned widely in small files plus a canonical definition site in a large source file, Tier 1 saturates the quota with incidental small-file hits and Tier 3 never runs. The canonical file is completely invisible from search results.
Real-world repro on this repo
$ codedb search Explorer --max-results 27
✓ 27 results for "Explorer"
src/adversarial_tests.zig:4 const Explorer = @import("explore.zig").Explorer;
src/bench.zig:4 const Explorer = @import("explore.zig").Explorer;
src/benchmark.zig:17 const Explorer = @import("explore.zig").Explorer;
... (24 more, none from src/explore.zig)
pub const Explorer = struct lives at src/explore.zig:495. The word index has 85 hits in that file. Tier 4 (word-index scan) would surface it — but only runs if Tier 1 didn't fill the quota, which it always does for common terms.
Affects every file >64KB:
src/explore.zig (233KB)
src/mcp.zig (182KB)
src/tests.zig (457KB)
src/index.zig (124KB)
For agents looking at the codedb codebase via codedb itself, the canonical definition is always missing.
searchContent should give files in skip_trigram_files with high per-file word-hit counts a chance to be in the result set even when Tier 1 has many small-file candidates. Sketch of fixes (any one would close):
Reserve a slot quota in Tier 1 for skip_trigram_files (e.g. top 1-2 hits from skip_trigram_files always make it through, then Tier 1 fills the rest).
Run Tier 4 (word-index) BEFORE Tier 3, and unconditionally check it before the small-file fill. The word index already has the canonical file's hit count.
(1) is structurally cleanest — the trigram-vs-word distinction shouldn't leak into search ranking.
Problem
Files larger than 64KB skip trigram indexing (
watcher.zig:446). They land inExplorer.skip_trigram_filesand are only reachable via Tier 3 ofsearchContent— which runs AFTER Tier 1 (trigram candidates) fills themax_resultsquota.For a common identifier mentioned widely in small files plus a canonical definition site in a large source file, Tier 1 saturates the quota with incidental small-file hits and Tier 3 never runs. The canonical file is completely invisible from search results.
Real-world repro on this repo
pub const Explorer = structlives atsrc/explore.zig:495. The word index has 85 hits in that file. Tier 4 (word-index scan) would surface it — but only runs if Tier 1 didn't fill the quota, which it always does for common terms.Affects every file >64KB:
src/explore.zig(233KB)src/mcp.zig(182KB)src/tests.zig(457KB)src/index.zig(124KB)For agents looking at the codedb codebase via codedb itself, the canonical definition is always missing.
Failing Test
Committed on
issue-447-failing-test(da5e646).Expected
searchContentshould give files inskip_trigram_fileswith high per-file word-hit counts a chance to be in the result set even when Tier 1 has many small-file candidates. Sketch of fixes (any one would close):skip_trigram_filesinto Tier 1's candidate pool, sorted by per-file word-hit count desc alongside trigram candidates. The existing explore: searchContent Tier 1 sort buries the definition-dense file behind unrelated small files #427 sort already prioritizes definition-dense files; this just lets large files participate.(1) is structurally cleanest — the trigram-vs-word distinction shouldn't leak into search ranking.