Skip to content

hotfix: doc-language penalty in rerankSignalScore (v0.2.5807)#433

Merged
justrach merged 1 commit intomainfrom
hotfix-0.2.5807-doc-penalty
May 6, 2026
Merged

hotfix: doc-language penalty in rerankSignalScore (v0.2.5807)#433
justrach merged 1 commit intomainfrom
hotfix-0.2.5807-doc-penalty

Conversation

@justrach
Copy link
Copy Markdown
Owner

@justrach justrach commented May 6, 2026

Summary

Hotfix on top of v0.2.5807. Live-binary testing showed CHANGELOG.md and benchmark *.md files with 4-6 mentions of an identifier on one line outranking actual code call sites for queries like searchContent, handleCallers, pathHasSegment. The Tier 0 code-first ordering correctly retrieves source files, but the rerank's per-line frequency then re-promotes any markdown line with high mention density.

Same release tag v0.2.5807 — version not bumped per request. Assets will be re-uploaded with --clobber after merge.

Fix

In rerankSignalScore (src/explore.zig), cap doc-language scores at 1.0 then halve:

if (isDocLanguage(detectLanguage(r.path))) {
    score = @min(score, 1.0) * 0.5;
}

For doc files, more mentions don't reflect more code-relevance — they reflect prose density. Cap+halve ensures any code hit (score >= 1) outranks any markdown / json / yaml / unknown-language hit. Symmetric with the existing path-prior penalty for tests/, examples/, vendor/, node_modules/, third_party/.

Test

test "issue-429-e: searchContent rerank penalises doc-language files so code beats markdown noise" — a 4-mention markdown line vs a 1-mention code call site. Pre-fix the markdown line ranks #1; post-fix the code call site ranks #1.

zig build test exit 0 (461 tests).

Test plan

  • zig build test
  • Reviewer spot-check: src/explore.zig rerankSignalScore — confirm cap-and-halve uses @min not multiplication
  • Re-run live binary on this repo: searchContent, handleCallers, pathHasSegment, BenchContext — markdown files should rank below code

🤖 Generated with Claude Code

…tfix)

Live-binary testing of v0.2.5807 showed CHANGELOG.md and benchmark *.md
files with 4-6 mentions of an identifier on one line outranking actual
code call sites under per-line frequency scoring. The Tier 0 code-first
ordering retrieves source files, but the rerank's per-line frequency
then re-promotes the high-density doc lines.

Cap doc-language scores at 1.0 then halve, so any code hit (score >= 1)
outranks any markdown / json / yaml / unknown-language hit. Symmetric
with the existing path-prior penalty for tests/examples/vendor.

Adds issue-429-e regression test demonstrating a 4-mention markdown line
no longer outranks a 1-mention code call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@justrach justrach merged commit f39d144 into main May 6, 2026
1 check passed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 535675 548449 +2.38% +12774 OK
codedb_changes 53178 56284 +5.84% +3106 OK
codedb_deps 10094 9742 -3.49% -352 OK
codedb_edit 6141 6133 -0.13% -8 OK
codedb_find 62896 61432 -2.33% -1464 OK
codedb_hot 99796 98929 -0.87% -867 OK
codedb_outline 302227 304358 +0.71% +2131 OK
codedb_read 94098 97226 +3.32% +3128 OK
codedb_search 198876 209369 +5.28% +10493 OK
codedb_snapshot 281552 283168 +0.57% +1616 OK
codedb_status 213253 215769 +1.18% +2516 OK
codedb_symbol 61312 60642 -1.09% -670 OK
codedb_tree 81939 80492 -1.77% -1447 OK
codedb_word 70815 71297 +0.68% +482 OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant