Context
Spec 005's librarian agent (PR #110) added an LLM-based topical-relevance judge (src/llmxive/librarian/relevance_judge.py) as a post-verification filter — it decides whether each candidate citation belongs in a literature review for the user's research question (yes/no per candidate).
Problem
The judge is non-deterministic: the same research question + the same candidate paper can produce different yes/no verdicts across runs, because the LLM call runs at temperature > 0.
Concrete evidence (observed during fix-up #4 v1.5.0 evaluation, 2026-05-10):
- PROJ-261 ("How does the local density of syntactic code clones correlate with the perplexity and bug-detection accuracy of pre-trained LMs on Python code?")
- Single-query probe of this question → 3 strict-pass citations, 0 marginal-fallback
- flesh_out re-validation invocation on the same question → 0 strict-pass, 9 marginal-fallback (judge rejected all 22 candidates → marginal-fallback rule fired)
Both behaviors are individually defensible — the question genuinely sits at a real cross-literature junction — but the variance means:
state/librarian-cache/ entries differ depending on which run populated them
- SC-012 (deterministic results across cache states) is technically violated on a cache-cold re-run
- A project's
## Search trail subsection can flip between "5 strict citations" and "9 marginal-fallback citations" on successive flesh_out re-runs
Impact
- LOW for now — the librarian still returns useful citations either way; the marginal-fallback rule ensures it never goes silent
- The diagnostic report (
notes/2026-05-07-spec-005-librarian-diagnostic.md § 6 P5-D12) documents this as a known lingering issue
Possible fixes (for whoever picks this up)
- Temperature=0 for the judge call — pass
temperature=0.0 to chat_with_fallback in relevance_judge.judge_one(). Simplest; may still have minor non-determinism on some backends but should be close to deterministic on Dartmouth Chat.
- Deterministic fingerprint-based judge — replace the LLM judge with a deterministic scoring function (e.g., embedding-cosine + threshold, or a frozen rules table). Loses the LLM's nuance but is fully reproducible.
- Cache the judge verdict, not just the result — store per-candidate
JudgeVerdict in the librarian cache keyed by (question_hash, candidate_pointer, prompt_version) so a cache-cold re-run replays the same verdicts. Doesn't fix the first run's variance but makes subsequent runs reproducible.
- Majority-vote over N judge calls — call the judge 3× and take the majority verdict. Reduces variance at 3× the LLM cost.
Recommendation: start with #1 (temperature=0) — cheapest, and verify the variance actually drops. If it doesn't, escalate to #3 (cache the verdict).
Related
🤖 Generated with Claude Code
Context
Spec 005's librarian agent (PR #110) added an LLM-based topical-relevance judge (
src/llmxive/librarian/relevance_judge.py) as a post-verification filter — it decides whether each candidate citation belongs in a literature review for the user's research question (yes/no per candidate).Problem
The judge is non-deterministic: the same research question + the same candidate paper can produce different yes/no verdicts across runs, because the LLM call runs at temperature > 0.
Concrete evidence (observed during fix-up #4 v1.5.0 evaluation, 2026-05-10):
Both behaviors are individually defensible — the question genuinely sits at a real cross-literature junction — but the variance means:
state/librarian-cache/entries differ depending on which run populated them## Search trailsubsection can flip between "5 strict citations" and "9 marginal-fallback citations" on successive flesh_out re-runsImpact
notes/2026-05-07-spec-005-librarian-diagnostic.md§ 6 P5-D12) documents this as a known lingering issuePossible fixes (for whoever picks this up)
temperature=0.0tochat_with_fallbackinrelevance_judge.judge_one(). Simplest; may still have minor non-determinism on some backends but should be close to deterministic on Dartmouth Chat.JudgeVerdictin the librarian cache keyed by (question_hash, candidate_pointer, prompt_version) so a cache-cold re-run replays the same verdicts. Doesn't fix the first run's variance but makes subsequent runs reproducible.Recommendation: start with #1 (temperature=0) — cheapest, and verify the variance actually drops. If it doesn't, escalate to #3 (cache the verdict).
Related
notes/2026-05-07-spec-005-librarian-diagnostic.md)revalidation-results.yamlPROJ-261 record documents the strict/marginal flip🤖 Generated with Claude Code