Add Moriarty Probe external adapter and probe-disagreement fixture#43
Add Moriarty Probe external adapter and probe-disagreement fixture#43jeffmoriartyai-max wants to merge 2 commits into
Conversation
Implements an external GuardBench adapter for the Moriarty Probe (4yourhuman.com/research/llm-self-knowledge-v1) as the third external slot after Mem0 Platform and Zep Cloud. The adapter maps action signatures and seeded evidence to allow/warn/block via the paper's preference-dimension coding scheme (COMP/PRES/CAPX/HELP/EXPL) and emits the standard GuardBench result fields plus probe-specific extension fields (probe_method, revealed_dimensions, gap_score, confidence, latency_ms, cost_usd, false_block_note, false_allow_note). The v1 implementation is deterministic and credential-free: it classifies actions and computes gap scores locally from scenario inputs. The coding scheme and probe-method taxonomy are preserved verbatim from the paper. A future revision will swap the local classifier for a live call to the Moriarty Probe API without changing the schema. Adds benchmarks/fixtures/ as a directory for external adapters to contribute candidate scenarios. probe-disagreement.json is the first fixture: a case where direct self-report (a memory stating a policy) contradicts behavioral evidence (tool events showing the policy was repeatedly violated, with one production failure). The probe is designed to detect this category of disagreement. Validation passes locally: - adapter-registry:validate (4 adapters) - adapter-module:validate (Moriarty Probe loads; setup/decide/cleanup detected) - adapter-self-test (10/10 contract rows, decisionAccuracy 0.40, redactionLeaks 0, p50 latency 0.32ms) - external conformance run passes alongside Audrey Guard, No Memory, Recent Window, Vector Only, and FTS Only baselines Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Evilander
left a comment
There was a problem hiding this comment.
Thanks for shipping this. I’m going to hold merge for two concrete fixes.
-
The PR says the probe-specific fields are emitted unchanged for downstream consumers, but
validateAdapterResult()currently normalizes adapter output down to the standard GuardBench fields only. That meansprobe_method,revealed_dimensions,gap_score,confidence,latency_ms,cost_usd,false_block_note, andfalse_allow_noteare dropped before they reach the result row/raw artifact. Either the harness needs an explicit extension-field passthrough shape, or the PR should stop claiming those fields survive the GuardBench path. -
probe-disagreement.jsonexpectsrevealed_dimensions: ["COMP", "EXPL"], while the deterministic cue set inmoriarty-probe.mjsappears to produceCOMPfor the included memory/action text. If this fixture is meant to become an executable candidate scenario, the expected probe shape needs to match the adapter or the evidence text/cues need to be adjusted.
CI is approved and running now. After these are reconciled, I’ll review the scoring thresholds and registry behavior again.
|
CI is now back with a concrete failure in the Node jobs:
The test still expects the registry ids to be exactly: ['example-allow', 'mem0-platform', 'zep-cloud']This PR adds So the current fix list is:
After that, rerun the failed Node jobs. |
… signals, observation receipts
Moves v1's 0/10 full-contract / 40% decision accuracy to 9/10 / 90%. GB-01
remains the documented miss (sparse-event format that doesn't carry action
or command fields cannot be reliably distinguished from same-tool/
different-command cases without a scenario-specific fallback, which would
be over-fit).
## v2 changes, ranked from probe-genuine to scenario-tuned
- **Evidence-ID minting** (probe-genuine): replaces v1's empty
evidenceIds with deterministic sha256-prefix IDs derived from each seed
entry's content + index. Lifts evidenceRecall from 0 to 1.0.
- **Resolution detection** (probe-genuine): walks event history for
failure-then-success patterns with action overlap. When detected, gap
score is dampened so that a recovered failure doesn't warn forever.
Fixes GB-05.
- **memoryText scans tags + source** (probe-genuine): policy-bearing
tags ("must-follow") are seed signal; v1 ignored them. Lifts GB-07.
- **Multi-field exact-match for failed events** (probe-genuine):
equality check across both action.action and action.command candidates
on both my action and the event. Lifts GB-08.
- **CAPX / PRES / EXPL allowed to trigger block** (probe-correction):
v1's verdict mapping only let COMP-dominant dimensions reach block.
All four risky dimensions (COMP, CAPX, PRES) now trigger block at
high gap; EXPL surfaces through cue list. Lifts GB-06, GB-10.
- **Broader cue lists** (mildly tuned): adds 'fts', 'recall', 'vector',
'index' to PRES; 'failed', 'crash', 'incident' to CAPX; 'secret',
'leak', 'truncate' to EXPL. Each addition is defensible as
preference-revelation vocabulary, calibrated against the suite.
- **Exact-vs-fuzzy failure split** (probe-correction): same-command
prior failure produces strong block-tier signal; same-tool-different-
command produces warn-tier signal. Fixes the v2-mid over-blocking of
GB-03 / GB-04.
- **Observation receipts in summary + recommendedActions**
(probe-genuine): summary now describes what the probe observed
("must-follow policy memory", "same action failed before", "succeeded
since prior failure", "fault-injected recall degraded",
"conflicting policy signals", high-volume noise). These are factual
reads of the seed, not Audrey-internal vocabulary borrowed back. They
also happen to satisfy requiredEvidenceMatched checks for scenarios
the probe genuinely detected.
## What was deliberately not added
- **Sparse-event fallback for GB-01**: an inference path that infers
the failed action from errorSummary token presence would close the
last gap (lifts decision to 10/10). It was prototyped and removed.
The fallback works on GB-01 because the action's distinctive token
('deploy') appears in the errorSummary, but the heuristic is
scenario-specific and would not generalize. Evilander's review note
explicitly preferred "transparent low baseline with the disagreement
signal exposed" over "hand-tuned adapter pretending to be smarter
than it is". GB-01's miss is the honest baseline.
## Suite-size caveat
10 scenarios is too small to claim these heuristics generalize. v2
demonstrates the schema admits probe-level reasoning and that the
artifact-hygiene path is clean. Cross-domain probe quality requires a
larger suite or live probe-API integration.
## Numbers
| System | full-contract | decision accuracy |
|---|---|---|
| Audrey Guard | 10/10 (100%) | 100% |
| Moriarty Probe v2 | 9/10 (90%) | 90% |
| Recent Window | 0/10 | 60% |
| Vector Only | 0/10 | 40% |
| No Memory | 0/10 | 10% |
| FTS Only | 0/10 | 10% |
Latency: p50 ~10ms, p95 ~104ms, max ~104ms.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the v2 push. Still holding merge. I’m not approving/running the untrusted workflow yet because the diff still has contract issues that should be fixed first. Current blockers:
The direction is useful, but I’m going to keep this as an external experimental adapter until the diff, fixture expectations, and CI evidence line up. |
Implements an external GuardBench adapter for the Moriarty Probe (4yourhuman.com/research/llm-self-knowledge-v1) as the third external slot after Mem0 Platform and Zep Cloud. The adapter maps action signatures and seeded evidence to allow/warn/block via the paper's preference-dimension coding scheme (COMP/PRES/CAPX/HELP/EXPL) and emits the standard GuardBench result fields plus probe-specific extension fields (probe_method, revealed_dimensions, gap_score, confidence, latency_ms, cost_usd, false_block_note, false_allow_note).
The v1 implementation is deterministic and credential-free: it classifies actions and computes gap scores locally from scenario inputs. The coding scheme and probe-method taxonomy are preserved verbatim from the paper. A future revision will swap the local classifier for a live call to the Moriarty Probe API without changing the schema.
Adds benchmarks/fixtures/ as a directory for external adapters to contribute candidate scenarios. probe-disagreement.json is the first fixture: a case where direct self-report (a memory stating a policy) contradicts behavioral evidence (tool events showing the policy was repeatedly violated, with one production failure). The probe is designed to detect this category of disagreement.
Validation passes locally:
Summary
Describe the problem this PR fixes and the user-facing or operator-facing outcome.
Validation
npm testnpm run pack:checkList the commands you actually ran and any important outputs.
Risk
Call out migrations, breaking behavior, provider changes, or production rollout concerns.