Librarian relevance-judge is non-deterministic (LLM temperature noise)

## Context

Spec 005's librarian agent (PR #110) added an LLM-based topical-relevance judge (`src/llmxive/librarian/relevance_judge.py`) as a post-verification filter — it decides whether each candidate citation belongs in a literature review for the user's research question (yes/no per candidate).

## Problem

The judge is **non-deterministic**: the same research question + the same candidate paper can produce different yes/no verdicts across runs, because the LLM call runs at temperature > 0.

Concrete evidence (observed during fix-up #4 v1.5.0 evaluation, 2026-05-10):

- PROJ-261 ("How does the local density of syntactic code clones correlate with the perplexity and bug-detection accuracy of pre-trained LMs on Python code?")
  - **Single-query probe** of this question → **3 strict-pass citations, 0 marginal-fallback**
  - **flesh_out re-validation invocation** on the same question → **0 strict-pass, 9 marginal-fallback** (judge rejected all 22 candidates → marginal-fallback rule fired)

Both behaviors are individually defensible — the question genuinely sits at a real cross-literature junction — but the variance means:
1. `state/librarian-cache/` entries differ depending on which run populated them
2. SC-012 (deterministic results across cache states) is technically violated on a cache-cold re-run
3. A project's `## Search trail` subsection can flip between "5 strict citations" and "9 marginal-fallback citations" on successive flesh_out re-runs

## Impact

- LOW for now — the librarian still returns *useful* citations either way; the marginal-fallback rule ensures it never goes silent
- The diagnostic report (`notes/2026-05-07-spec-005-librarian-diagnostic.md` § 6 P5-D12) documents this as a known lingering issue

## Possible fixes (for whoever picks this up)

1. **Temperature=0 for the judge call** — pass `temperature=0.0` to `chat_with_fallback` in `relevance_judge.judge_one()`. Simplest; may still have minor non-determinism on some backends but should be close to deterministic on Dartmouth Chat.
2. **Deterministic fingerprint-based judge** — replace the LLM judge with a deterministic scoring function (e.g., embedding-cosine + threshold, or a frozen rules table). Loses the LLM's nuance but is fully reproducible.
3. **Cache the judge verdict, not just the result** — store per-candidate `JudgeVerdict` in the librarian cache keyed by (question_hash, candidate_pointer, prompt_version) so a cache-cold re-run replays the same verdicts. Doesn't fix the *first* run's variance but makes subsequent runs reproducible.
4. **Majority-vote over N judge calls** — call the judge 3× and take the majority verdict. Reduces variance at 3× the LLM cost.

Recommendation: start with #1 (temperature=0) — cheapest, and verify the variance actually drops. If it doesn't, escalate to #3 (cache the verdict).

## Related

- PR #110 (spec 005 librarian agent)
- Diagnostic § 6 P5-D12 (`notes/2026-05-07-spec-005-librarian-diagnostic.md`)
- `revalidation-results.yaml` PROJ-261 record documents the strict/marginal flip

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Librarian relevance-judge is non-deterministic (LLM temperature noise) #112

Context

Problem

Impact

Possible fixes (for whoever picks this up)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Librarian relevance-judge is non-deterministic (LLM temperature noise) #112

Description

Context

Problem

Impact

Possible fixes (for whoever picks this up)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions