diff --git a/CHANGELOG.md b/CHANGELOG.md index ce230ce..7ed1f5f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,7 @@ - **Tier 0 processes code before docs.** With `max_results=50` and the per-file cap of 10, five markdown files mentioning the query 10+ times each could collectively saturate the quota before the canonical source file was reached, leaving the source file completely absent from results. A new `isDocLanguage(Language)` predicate gates a two-pass loop: code-language hits first, doc-language hits second. Same per-file cap, same dedup, same early-return — only iteration order changes. Source files now win the recall race. - **Multi-signal rerank.** The post-pass rerank counted per-line query occurrences only and broke ties on path-asc + line-asc, which buried symbol-definition lines under alphabetically-earlier comment mentions, ranked `examples/foo.zig` above `src/foo.zig`, and lost basename-match intent entirely. New `rerankSignalScore` composes per-line occurrence count, a symbol-definition boost (+5 when the hit line is a defined symbol whose name matches the query, looked up via outlines), a basename-match boost (+15 exact stem, +8 substring, case-insensitive), a path-segment match boost (+6 for queries like `parser` matching `src/parser/foo.zig`), and a path-prior penalty (×0.6 for `tests/`, `examples/`; ×0.4 for `vendor/`, `node_modules/`, `third_party/`). Constants are tuned so a 5x-higher per-line frequency still wins on its own, while each signal individually flips alphabetic ties. - **Rerank applies on every return path.** Pre-fix the multi-signal rerank only ran on fall-through to the final return; Tier 0 and Tier 1 early-returns at `max_results` bypassed it entirely. Lifting the rerank into a `rerankAndFinalize` helper called from every searchContent return point gives the symbol-def / basename / path-prior signals consistent coverage regardless of which tier filled the quota. +- **Doc-language penalty in rerank.** Live-binary testing showed CHANGELOG and benchmark `.md` files with 4-6 mentions of an identifier on one line outranking actual code call sites under per-line frequency. The reranker now caps doc-language scores at 1.0 then halves them, so any code hit (`score >= 1`) outranks any markdown / json / yaml / unknown-language hit. Symmetric with the path-prior penalty. ### Validation diff --git a/src/explore.zig b/src/explore.zig index 07b7c0c..d880bf1 100644 --- a/src/explore.zig +++ b/src/explore.zig @@ -1778,6 +1778,15 @@ pub const Explorer = struct { if (pathHasSegment(r.path, "examples") or pathHasSegment(r.path, "example")) score *= 0.6; if (pathHasSegment(r.path, "vendor") or pathHasSegment(r.path, "node_modules") or pathHasSegment(r.path, "third_party")) score *= 0.4; + // Doc-language penalty: markdown / data files (CHANGELOG.md, design + // docs, benchmark logs) often mention an identifier many times in a + // single line, which lets per-line frequency dwarf code call sites. + // For doc files, more mentions don't reflect more code-relevance — + // they reflect prose density. Cap at 1.0 then halve so any code hit + // (score >= 1) outranks any doc hit. Symmetric with path-prior. + if (isDocLanguage(detectLanguage(r.path))) { + score = @min(score, 1.0) * 0.5; + } return score; } diff --git a/src/tests.zig b/src/tests.zig index e481071..8cf63d1 100644 --- a/src/tests.zig +++ b/src/tests.zig @@ -10789,3 +10789,36 @@ test "issue-429-d: searchContent rerank boosts path-segment match" { try testing.expect(results.len >= 2); try testing.expectEqualStrings("src/parser/foo.zig", results[0].path); } + +test "issue-429-e: searchContent rerank penalises doc-language files so code beats markdown noise" { + // CHANGELOG.md and benchmark docs often mention an identifier many times + // in a single line, which under per-line frequency outscores any single + // code call site. The reranker now halves doc-language scores so a code + // call site with one occurrence still wins. + var arena = std.heap.ArenaAllocator.init(testing.allocator); + defer arena.deinit(); + var explorer = Explorer.init(arena.allocator()); + + // Doc file with the identifier mentioned four times on one line — + // pre-fix this scores 4 on per-line frequency. + try explorer.indexFile( + "CHANGELOG.md", + "# Changelog\n\nfooBar — fooBar fooBar fooBar in the changelog.\n", + ); + // Code call site with the identifier mentioned once. + try explorer.indexFile( + "src/caller.zig", + "pub fn caller() void {\n fooBar();\n}\n", + ); + + const results = try explorer.searchContent("fooBar", testing.allocator, 10); + defer { + for (results) |r| { + testing.allocator.free(r.path); + testing.allocator.free(r.line_text); + } + testing.allocator.free(results); + } + try testing.expect(results.len >= 2); + try testing.expectEqualStrings("src/caller.zig", results[0].path); +}