Skip to content

explore: popular identifiers bypass Tier 0 code/doc diversity #449

@justrach

Description

@justrach

Problem

Tier 0 direct word-index lookup has the code-first/doc-second diversity logic from #430, but it only runs when:

word_hits.len > 0 and word_hits.len <= max_results * 2

For popular identifiers, word_hits.len exceeds that gate, so Tier 0 is skipped entirely. Search falls through to Tier 1, where candidate ordering is driven by per-file hit count before rerank has a chance to apply the doc-language penalty. Markdown files with many mentions can fill max_results before the canonical source file is scanned.

This is distinct from #447: the source file can be small and fully trigram-indexed; it is still starved because the Tier 0 gate bypasses the code/doc ordering.

Repro

test "popular markdown should not disable Tier 0 code-first behavior" {
    var arena = std.heap.ArenaAllocator.init(testing.allocator);
    defer arena.deinit();
    var explorer = Explorer.init(arena.allocator());

    const md_block =
        "fooBar mentioned here.\n" ++
        "fooBar mentioned here.\n" ++
        "fooBar mentioned here.\n" ++
        "fooBar mentioned here.\n" ++
        "fooBar mentioned here.\n";

    var i: usize = 0;
    while (i < 10) : (i += 1) {
        var path_buf: [64]u8 = undefined;
        const path = try std.fmt.bufPrint(&path_buf, "docs/notes_{d}.md", .{i});
        try explorer.indexFile(path, md_block);
    }

    try explorer.indexFile("src/foo.zig",
        "pub fn fooBar() void {}\n" ++
            "pub fn caller1() void { fooBar(); }\n" ++
            "pub fn caller2() void { fooBar(); }\n" ++
            "pub fn caller3() void { fooBar(); }\n");

    const results = try explorer.searchContent("fooBar", testing.allocator, 10);

    var found_source = false;
    for (results) |r| {
        if (std.mem.eql(u8, r.path, "src/foo.zig")) found_source = true;
    }
    try testing.expect(found_source);
}

Current observed results for max_results=10 are all markdown:

docs/notes_0.md:1
docs/notes_1.md:1
docs/notes_2.md:1
docs/notes_3.md:1
docs/notes_4.md:1
docs/notes_5.md:1
docs/notes_6.md:1
docs/notes_7.md:1
docs/notes_8.md:1
docs/notes_9.md:1

Expected

Popular identifiers should still get the Tier 0 code/doc diversity behavior. Possible shape: avoid gating on total posting-list length, or run a bounded pass that collects per-file code hits first before allowing docs to fill the remaining quota.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions