Skip to content

Librarian relevance-judge is non-deterministic (LLM temperature noise) #112

@jeremymanning

Description

@jeremymanning

Context

Spec 005's librarian agent (PR #110) added an LLM-based topical-relevance judge (src/llmxive/librarian/relevance_judge.py) as a post-verification filter — it decides whether each candidate citation belongs in a literature review for the user's research question (yes/no per candidate).

Problem

The judge is non-deterministic: the same research question + the same candidate paper can produce different yes/no verdicts across runs, because the LLM call runs at temperature > 0.

Concrete evidence (observed during fix-up #4 v1.5.0 evaluation, 2026-05-10):

  • PROJ-261 ("How does the local density of syntactic code clones correlate with the perplexity and bug-detection accuracy of pre-trained LMs on Python code?")
    • Single-query probe of this question → 3 strict-pass citations, 0 marginal-fallback
    • flesh_out re-validation invocation on the same question → 0 strict-pass, 9 marginal-fallback (judge rejected all 22 candidates → marginal-fallback rule fired)

Both behaviors are individually defensible — the question genuinely sits at a real cross-literature junction — but the variance means:

  1. state/librarian-cache/ entries differ depending on which run populated them
  2. SC-012 (deterministic results across cache states) is technically violated on a cache-cold re-run
  3. A project's ## Search trail subsection can flip between "5 strict citations" and "9 marginal-fallback citations" on successive flesh_out re-runs

Impact

  • LOW for now — the librarian still returns useful citations either way; the marginal-fallback rule ensures it never goes silent
  • The diagnostic report (notes/2026-05-07-spec-005-librarian-diagnostic.md § 6 P5-D12) documents this as a known lingering issue

Possible fixes (for whoever picks this up)

  1. Temperature=0 for the judge call — pass temperature=0.0 to chat_with_fallback in relevance_judge.judge_one(). Simplest; may still have minor non-determinism on some backends but should be close to deterministic on Dartmouth Chat.
  2. Deterministic fingerprint-based judge — replace the LLM judge with a deterministic scoring function (e.g., embedding-cosine + threshold, or a frozen rules table). Loses the LLM's nuance but is fully reproducible.
  3. Cache the judge verdict, not just the result — store per-candidate JudgeVerdict in the librarian cache keyed by (question_hash, candidate_pointer, prompt_version) so a cache-cold re-run replays the same verdicts. Doesn't fix the first run's variance but makes subsequent runs reproducible.
  4. Majority-vote over N judge calls — call the judge 3× and take the majority verdict. Reduces variance at 3× the LLM cost.

Recommendation: start with #1 (temperature=0) — cheapest, and verify the variance actually drops. If it doesn't, escalate to #3 (cache the verdict).

Related

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions