explore: searchContent invisibility for canonical definition sites in files >64KB

## Problem

Files larger than 64KB skip trigram indexing (`watcher.zig:446`). They land in `Explorer.skip_trigram_files` and are only reachable via **Tier 3** of `searchContent` — which runs AFTER Tier 1 (trigram candidates) fills the `max_results` quota.

For a common identifier mentioned widely in small files plus a canonical definition site in a large source file, Tier 1 saturates the quota with incidental small-file hits and Tier 3 never runs. The canonical file is completely invisible from search results.

## Real-world repro on this repo

```
$ codedb search Explorer --max-results 27
✓ 27 results for "Explorer"
  src/adversarial_tests.zig:4   const Explorer = @import("explore.zig").Explorer;
  src/bench.zig:4                const Explorer = @import("explore.zig").Explorer;
  src/benchmark.zig:17           const Explorer = @import("explore.zig").Explorer;
  ... (24 more, none from src/explore.zig)
```

`pub const Explorer = struct` lives at `src/explore.zig:495`. The word index has **85 hits** in that file. Tier 4 (word-index scan) would surface it — but only runs if Tier 1 didn't fill the quota, which it always does for common terms.

Affects every file >64KB:
- `src/explore.zig` (233KB)
- `src/mcp.zig` (182KB)
- `src/tests.zig` (457KB)
- `src/index.zig` (124KB)

For agents looking at the codedb codebase via codedb itself, the canonical definition is always missing.

## Failing Test

```zig
test "issue-447: searchContent surfaces large (>64KB) skip-trigram files for common identifiers" {
    var arena = std.heap.ArenaAllocator.init(testing.allocator);
    defer arena.deinit();
    var explorer = Explorer.init(arena.allocator());

    var i: usize = 0;
    while (i < 12) : (i += 1) {
        var path_buf: [32]u8 = undefined;
        const path = try std.fmt.bufPrint(&path_buf, "small_{d}.zig", .{i});
        try explorer.indexFile(path, "fn s() void { _ = widgetX; }\n");
    }

    const canonical_content =
        "fn canonical() void {\n" ++
        "    _ = widgetX;\n" ++ // ×5
        "}\n";
    try explorer.indexFileSkipTrigram("canonical.zig", canonical_content);

    const results = try explorer.searchContent("widgetX", testing.allocator, 5);
    var found_canonical = false;
    for (results) |r| {
        if (std.mem.eql(u8, r.path, "canonical.zig")) found_canonical = true;
    }
    try testing.expect(found_canonical);
}
```

Committed on `issue-447-failing-test` (da5e646).

## Expected

`searchContent` should give files in `skip_trigram_files` with high per-file word-hit counts a chance to be in the result set even when Tier 1 has many small-file candidates. Sketch of fixes (any one would close):

1. Merge `skip_trigram_files` into Tier 1's candidate pool, sorted by per-file word-hit count desc alongside trigram candidates. The existing #427 sort already prioritizes definition-dense files; this just lets large files participate.
2. Reserve a slot quota in Tier 1 for skip_trigram_files (e.g. top 1-2 hits from skip_trigram_files always make it through, then Tier 1 fills the rest).
3. Run Tier 4 (word-index) BEFORE Tier 3, and unconditionally check it before the small-file fill. The word index already has the canonical file's hit count.

(1) is structurally cleanest — the trigram-vs-word distinction shouldn't leak into search ranking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore: searchContent invisibility for canonical definition sites in files >64KB #447

Problem

Real-world repro on this repo

Failing Test

Expected

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

explore: searchContent invisibility for canonical definition sites in files >64KB #447

Description

Problem

Real-world repro on this repo

Failing Test

Expected

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions