Skip to content

explore: searchContent invisibility for canonical definition sites in files >64KB #447

@justrach

Description

@justrach

Problem

Files larger than 64KB skip trigram indexing (watcher.zig:446). They land in Explorer.skip_trigram_files and are only reachable via Tier 3 of searchContent — which runs AFTER Tier 1 (trigram candidates) fills the max_results quota.

For a common identifier mentioned widely in small files plus a canonical definition site in a large source file, Tier 1 saturates the quota with incidental small-file hits and Tier 3 never runs. The canonical file is completely invisible from search results.

Real-world repro on this repo

$ codedb search Explorer --max-results 27
✓ 27 results for "Explorer"
  src/adversarial_tests.zig:4   const Explorer = @import("explore.zig").Explorer;
  src/bench.zig:4                const Explorer = @import("explore.zig").Explorer;
  src/benchmark.zig:17           const Explorer = @import("explore.zig").Explorer;
  ... (24 more, none from src/explore.zig)

pub const Explorer = struct lives at src/explore.zig:495. The word index has 85 hits in that file. Tier 4 (word-index scan) would surface it — but only runs if Tier 1 didn't fill the quota, which it always does for common terms.

Affects every file >64KB:

  • src/explore.zig (233KB)
  • src/mcp.zig (182KB)
  • src/tests.zig (457KB)
  • src/index.zig (124KB)

For agents looking at the codedb codebase via codedb itself, the canonical definition is always missing.

Failing Test

test "issue-447: searchContent surfaces large (>64KB) skip-trigram files for common identifiers" {
    var arena = std.heap.ArenaAllocator.init(testing.allocator);
    defer arena.deinit();
    var explorer = Explorer.init(arena.allocator());

    var i: usize = 0;
    while (i < 12) : (i += 1) {
        var path_buf: [32]u8 = undefined;
        const path = try std.fmt.bufPrint(&path_buf, "small_{d}.zig", .{i});
        try explorer.indexFile(path, "fn s() void { _ = widgetX; }\n");
    }

    const canonical_content =
        "fn canonical() void {\n" ++
        "    _ = widgetX;\n" ++ // ×5
        "}\n";
    try explorer.indexFileSkipTrigram("canonical.zig", canonical_content);

    const results = try explorer.searchContent("widgetX", testing.allocator, 5);
    var found_canonical = false;
    for (results) |r| {
        if (std.mem.eql(u8, r.path, "canonical.zig")) found_canonical = true;
    }
    try testing.expect(found_canonical);
}

Committed on issue-447-failing-test (da5e646).

Expected

searchContent should give files in skip_trigram_files with high per-file word-hit counts a chance to be in the result set even when Tier 1 has many small-file candidates. Sketch of fixes (any one would close):

  1. Merge skip_trigram_files into Tier 1's candidate pool, sorted by per-file word-hit count desc alongside trigram candidates. The existing explore: searchContent Tier 1 sort buries the definition-dense file behind unrelated small files #427 sort already prioritizes definition-dense files; this just lets large files participate.
  2. Reserve a slot quota in Tier 1 for skip_trigram_files (e.g. top 1-2 hits from skip_trigram_files always make it through, then Tier 1 fills the rest).
  3. Run Tier 4 (word-index) BEFORE Tier 3, and unconditionally check it before the small-file fill. The word index already has the canonical file's hit count.

(1) is structurally cleanest — the trigram-vs-word distinction shouldn't leak into search ranking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions