hotfix: doc-language penalty in rerankSignalScore (v0.2.5807)#433
Merged
hotfix: doc-language penalty in rerankSignalScore (v0.2.5807)#433
Conversation
…tfix) Live-binary testing of v0.2.5807 showed CHANGELOG.md and benchmark *.md files with 4-6 mentions of an identifier on one line outranking actual code call sites under per-line frequency scoring. The Tier 0 code-first ordering retrieves source files, but the rerank's per-line frequency then re-promotes the high-density doc lines. Cap doc-language scores at 1.0 then halve, so any code hit (score >= 1) outranks any markdown / json / yaml / unknown-language hit. Symmetric with the existing path-prior penalty for tests/examples/vendor. Adds issue-429-e regression test demonstrating a 4-mention markdown line no longer outranks a 1-mention code call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hotfix on top of v0.2.5807. Live-binary testing showed CHANGELOG.md and benchmark
*.mdfiles with 4-6 mentions of an identifier on one line outranking actual code call sites for queries likesearchContent,handleCallers,pathHasSegment. The Tier 0 code-first ordering correctly retrieves source files, but the rerank's per-line frequency then re-promotes any markdown line with high mention density.Same release tag
v0.2.5807— version not bumped per request. Assets will be re-uploaded with--clobberafter merge.Fix
In
rerankSignalScore(src/explore.zig), cap doc-language scores at 1.0 then halve:For doc files, more mentions don't reflect more code-relevance — they reflect prose density. Cap+halve ensures any code hit (
score >= 1) outranks any markdown / json / yaml / unknown-language hit. Symmetric with the existing path-prior penalty fortests/,examples/,vendor/,node_modules/,third_party/.Test
test "issue-429-e: searchContent rerank penalises doc-language files so code beats markdown noise"— a 4-mention markdown line vs a 1-mention code call site. Pre-fix the markdown line ranks #1; post-fix the code call site ranks #1.zig build testexit 0 (461 tests).Test plan
zig build testsrc/explore.zigrerankSignalScore — confirm cap-and-halve uses@minnot multiplicationsearchContent,handleCallers,pathHasSegment,BenchContext— markdown files should rank below code🤖 Generated with Claude Code