Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
- **Tier 0 processes code before docs.** With `max_results=50` and the per-file cap of 10, five markdown files mentioning the query 10+ times each could collectively saturate the quota before the canonical source file was reached, leaving the source file completely absent from results. A new `isDocLanguage(Language)` predicate gates a two-pass loop: code-language hits first, doc-language hits second. Same per-file cap, same dedup, same early-return — only iteration order changes. Source files now win the recall race.
- **Multi-signal rerank.** The post-pass rerank counted per-line query occurrences only and broke ties on path-asc + line-asc, which buried symbol-definition lines under alphabetically-earlier comment mentions, ranked `examples/foo.zig` above `src/foo.zig`, and lost basename-match intent entirely. New `rerankSignalScore` composes per-line occurrence count, a symbol-definition boost (+5 when the hit line is a defined symbol whose name matches the query, looked up via outlines), a basename-match boost (+15 exact stem, +8 substring, case-insensitive), a path-segment match boost (+6 for queries like `parser` matching `src/parser/foo.zig`), and a path-prior penalty (×0.6 for `tests/`, `examples/`; ×0.4 for `vendor/`, `node_modules/`, `third_party/`). Constants are tuned so a 5x-higher per-line frequency still wins on its own, while each signal individually flips alphabetic ties.
- **Rerank applies on every return path.** Pre-fix the multi-signal rerank only ran on fall-through to the final return; Tier 0 and Tier 1 early-returns at `max_results` bypassed it entirely. Lifting the rerank into a `rerankAndFinalize` helper called from every searchContent return point gives the symbol-def / basename / path-prior signals consistent coverage regardless of which tier filled the quota.
- **Doc-language penalty in rerank.** Live-binary testing showed CHANGELOG and benchmark `.md` files with 4-6 mentions of an identifier on one line outranking actual code call sites under per-line frequency. The reranker now caps doc-language scores at 1.0 then halves them, so any code hit (`score >= 1`) outranks any markdown / json / yaml / unknown-language hit. Symmetric with the path-prior penalty.

### Validation

Expand Down
9 changes: 9 additions & 0 deletions src/explore.zig
Original file line number Diff line number Diff line change
Expand Up @@ -1778,6 +1778,15 @@ pub const Explorer = struct {
if (pathHasSegment(r.path, "examples") or pathHasSegment(r.path, "example")) score *= 0.6;
if (pathHasSegment(r.path, "vendor") or pathHasSegment(r.path, "node_modules") or
pathHasSegment(r.path, "third_party")) score *= 0.4;
// Doc-language penalty: markdown / data files (CHANGELOG.md, design
// docs, benchmark logs) often mention an identifier many times in a
// single line, which lets per-line frequency dwarf code call sites.
// For doc files, more mentions don't reflect more code-relevance —
// they reflect prose density. Cap at 1.0 then halve so any code hit
// (score >= 1) outranks any doc hit. Symmetric with path-prior.
if (isDocLanguage(detectLanguage(r.path))) {
score = @min(score, 1.0) * 0.5;
}

return score;
}
Expand Down
33 changes: 33 additions & 0 deletions src/tests.zig
Original file line number Diff line number Diff line change
Expand Up @@ -10789,3 +10789,36 @@ test "issue-429-d: searchContent rerank boosts path-segment match" {
try testing.expect(results.len >= 2);
try testing.expectEqualStrings("src/parser/foo.zig", results[0].path);
}

test "issue-429-e: searchContent rerank penalises doc-language files so code beats markdown noise" {
// CHANGELOG.md and benchmark docs often mention an identifier many times
// in a single line, which under per-line frequency outscores any single
// code call site. The reranker now halves doc-language scores so a code
// call site with one occurrence still wins.
var arena = std.heap.ArenaAllocator.init(testing.allocator);
defer arena.deinit();
var explorer = Explorer.init(arena.allocator());

// Doc file with the identifier mentioned four times on one line —
// pre-fix this scores 4 on per-line frequency.
try explorer.indexFile(
"CHANGELOG.md",
"# Changelog\n\nfooBar — fooBar fooBar fooBar in the changelog.\n",
);
// Code call site with the identifier mentioned once.
try explorer.indexFile(
"src/caller.zig",
"pub fn caller() void {\n fooBar();\n}\n",
);

const results = try explorer.searchContent("fooBar", testing.allocator, 10);
defer {
for (results) |r| {
testing.allocator.free(r.path);
testing.allocator.free(r.line_text);
}
testing.allocator.free(results);
}
try testing.expect(results.len >= 2);
try testing.expectEqualStrings("src/caller.zig", results[0].path);
}
Loading