Problem
Tier 0 direct word-index lookup has the code-first/doc-second diversity logic from #430, but it only runs when:
word_hits.len > 0 and word_hits.len <= max_results * 2
For popular identifiers, word_hits.len exceeds that gate, so Tier 0 is skipped entirely. Search falls through to Tier 1, where candidate ordering is driven by per-file hit count before rerank has a chance to apply the doc-language penalty. Markdown files with many mentions can fill max_results before the canonical source file is scanned.
This is distinct from #447: the source file can be small and fully trigram-indexed; it is still starved because the Tier 0 gate bypasses the code/doc ordering.
Repro
test "popular markdown should not disable Tier 0 code-first behavior" {
var arena = std.heap.ArenaAllocator.init(testing.allocator);
defer arena.deinit();
var explorer = Explorer.init(arena.allocator());
const md_block =
"fooBar mentioned here.\n" ++
"fooBar mentioned here.\n" ++
"fooBar mentioned here.\n" ++
"fooBar mentioned here.\n" ++
"fooBar mentioned here.\n";
var i: usize = 0;
while (i < 10) : (i += 1) {
var path_buf: [64]u8 = undefined;
const path = try std.fmt.bufPrint(&path_buf, "docs/notes_{d}.md", .{i});
try explorer.indexFile(path, md_block);
}
try explorer.indexFile("src/foo.zig",
"pub fn fooBar() void {}\n" ++
"pub fn caller1() void { fooBar(); }\n" ++
"pub fn caller2() void { fooBar(); }\n" ++
"pub fn caller3() void { fooBar(); }\n");
const results = try explorer.searchContent("fooBar", testing.allocator, 10);
var found_source = false;
for (results) |r| {
if (std.mem.eql(u8, r.path, "src/foo.zig")) found_source = true;
}
try testing.expect(found_source);
}
Current observed results for max_results=10 are all markdown:
docs/notes_0.md:1
docs/notes_1.md:1
docs/notes_2.md:1
docs/notes_3.md:1
docs/notes_4.md:1
docs/notes_5.md:1
docs/notes_6.md:1
docs/notes_7.md:1
docs/notes_8.md:1
docs/notes_9.md:1
Expected
Popular identifiers should still get the Tier 0 code/doc diversity behavior. Possible shape: avoid gating on total posting-list length, or run a bounded pass that collects per-file code hits first before allowing docs to fill the remaining quota.
Problem
Tier 0 direct word-index lookup has the code-first/doc-second diversity logic from #430, but it only runs when:
For popular identifiers,
word_hits.lenexceeds that gate, so Tier 0 is skipped entirely. Search falls through to Tier 1, where candidate ordering is driven by per-file hit count before rerank has a chance to apply the doc-language penalty. Markdown files with many mentions can fillmax_resultsbefore the canonical source file is scanned.This is distinct from #447: the source file can be small and fully trigram-indexed; it is still starved because the Tier 0 gate bypasses the code/doc ordering.
Repro
Current observed results for
max_results=10are all markdown:Expected
Popular identifiers should still get the Tier 0 code/doc diversity behavior. Possible shape: avoid gating on total posting-list length, or run a bounded pass that collects per-file code hits first before allowing docs to fill the remaining quota.