feat(research/systemic_risk): Protocol X-7 — CSD indicators, naive baselines, extended metrics#565
Conversation
…selines, extended metrics
Score-level instrumentation extension. Status remains
HYPOTHESIS / SCORE-LEVEL INSTRUMENTATION EXTENSION ONLY;
end-to-end validation remains pending.
CSD INDICATORS (new module critical_slowing_down.py)
Variance + lag-1 autocorrelation + skewness over a trailing
rolling window. CSDConfig pre-registers window, min_periods,
ddof, lag, constant_policy ∈ {nan, zero, raise}. The
no-lookahead contract is enforced by a regression test that
mutates a future segment of the input and asserts every past
indicator value is bit-identical (the load-bearing rail of
the X-7 spec). Skewness implemented inline (no SciPy dep).
Constant-segment behaviour for autocorr/skewness defaults to
NaN — propagating undefinedness honestly rather than faking
a "calm" signal with a zero.
NAIVE BASELINES (new module baselines.py)
rolling_volatility_score — pure trailing-window σ, no phase /
coupling / graph. The "is the market just loud?" challenger.
edge_density_score — per-snapshot directed / undirected edge
density of an adjacency panel; one scalar per timestamp; no
dynamics. Defeats the candidate when the apparent signal is
topology densification.
Both baselines fail-closed on NaN/Inf, non-square / inconsistent-N
/ negative inputs, and refuse to operate without a valid
rolling-window contract.
EXTENDED METRICS (new module metrics.py)
ClassificationMetrics — TP/FP/TN/FN + precision + recall + FPR
+ FNR. Every undefined denominator emits NaN, never 0; the
absence of denominators must propagate.
LeadTimeConfig — pre-registered min/max lead window + optional
post-event exclusion buffer. Same-day signals excluded by
default (min_lead_days=1); post-event signals never count.
LeadTimeMetrics — aggregate over a labelled event set;
detected count, sorted lead-time tuple, median + min + max.
compute_lead_time_metrics uses the first valid pre-event alarm
per event.
DOCS — 7 new audit artefacts
BASELINES.md — what defeats the candidate, why it matters
METRICS.md — AUC alone insufficient; NaN policy stated
NULL_MODELS.md — six surrogates + executable status
FAILURE_MODES.md — 10 disconfirming experiments to probe
REPRODUCIBILITY.md — manifest contract + per-artefact PENDING table
BOOTSTRAP_PROTOCOL.md — what is/isn't resampled, seed/convergence
CHANGELOG.md — protocol-grade change log; no validated claim
TESTS — +49, total 218 passing
CSD: rejects 2-D / empty / NaN / Inf; window/min_periods/lag
validation; output length contract; insufficient prefix is
NaN; valid_count grows; no-lookahead leakage regression;
constant-policy nan/zero/raise; zero-variance skew → NaN.
Baselines: rolling-volatility no leakage; constant series → 0;
window validation; density formulas (directed / undirected /
self-edges); panel-N consistency; NaN/negative rejection;
single-node density.
Metrics: normal case; zero-prediction → NaN precision;
zero-positive → NaN recall + NaN FNR; no-negatives → NaN FPR;
LeadTimeConfig invariants; pre-event alarm counted;
post-event ignored; same-day exclusion at min_lead=1;
same-day inclusion at min_lead=0; first valid signal
selected; no-signal → undetected.
Pre-existing test_governance.py::test_real_module_passes_overclaim_grep
passes against the live tree including the 7 new docs +
CHANGELOG (overclaim_hits = ()).
QUALITY
mypy --strict / ruff / black: clean on every new/modified file.
Pre-existing 5 jax_engine mypy errors persist on origin/main;
out of scope.
CLAIM TIER PRESERVED
C-SYSRISK-PHASE remains HYPOTHESIS in CLAIMS.md.
No measured / validated / confirmed / production claim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9a6765a3d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.ddof < 0: | ||
| raise ValueError(f"ddof must be >= 0, got {self.ddof}") |
There was a problem hiding this comment.
Reject invalid ddof values in CSDConfig
CSDConfig only checks ddof >= 0, so configurations like min_periods=5, ddof=8 are accepted even though early rolling windows have N <= ddof; seg.var(ddof=ddof) then emits inf/NaN with runtime warnings, which silently corrupts indicator series after warmup instead of failing closed. Because ddof is user-configurable and pre-registered, this can invalidate experiments without an explicit error unless you also enforce ddof < min_periods (or an equivalent bound tied to the smallest evaluated window).
Useful? React with 👍 / 👎.
| if not directed: | ||
| denom = denom / 2.0 |
There was a problem hiding this comment.
Fix undirected edge-density normalization
In undirected mode, the code keeps counting both A[i,j] and A[j,i] from a symmetric adjacency matrix but divides by N*(N-1)/2, so a fully connected undirected graph returns density 2.0 instead of 1.0. This makes the baseline scale inconsistent and can distort thresholding/comparisons whenever callers pass standard symmetric undirected adjacencies.
Useful? React with 👍 / 👎.
P0-1 — CSDConfig.ddof < min_periods invariant
ddof >= min_periods left the rolling variance with 0 dof on the
smallest evaluated window → silent NaN. Fail-closed at config
construction. Tests:
- test_ddof_must_be_less_than_min_periods (rejects ddof=5,mp=5)
- test_ddof_less_than_min_periods_accepted
- test_ddof_zero_accepted
P0-2 — undirected edge density canonical formula
Old code summed the full symmetric matrix and produced density
= 2.0 for the complete undirected graph (out of [0, 1]). Fixed
to read the strict upper triangle (k=1, or k=0 with self-edges)
and divide by N*(N-1)/2 (or N*(N+1)/2). Symmetry is now enforced
fail-closed under directed=False — a transpose bug raises
rather than silently distorting the density scale. Tests:
- test_undirected_complete_graph_density_is_one (K3 → 1.0)
- test_undirected_requires_symmetric_matrix
- test_density_in_unit_interval_for_random_binary
(property sweep across N ∈ {3,5,10,20}, p ∈ {0.1,0.3,0.5,0.8})
P1-1 — lead-time strict-increasing dates
Old code had no monotonicity check; an unsorted/duplicate dates
tuple would silently produce wrong leads. Now raises ValueError.
Test: test_dates_must_be_strictly_increasing.
P1-2 — lead-time finite threshold
Old code accepted threshold = ±Inf and produced NaN in the
comparison. Now raises ValueError on non-finite.
Test: test_threshold_must_be_finite.
P1-3 — explicit score-NaN policy
Added allow_warmup_nan=True parameter. Default tolerates a
leading contiguous NaN block (rolling-window warmup) but
rejects any NaN/Inf past the first finite value. False mode
rejects every non-finite value. Tests:
- test_warmup_nan_allowed_by_default
- test_nan_past_warmup_rejected
- test_strict_finite_score_mode
P1-4 — removal of dead event_exclusion_days_after API
The parameter existed in LeadTimeConfig but had no effect on
compute_lead_time_metrics. Removed entirely. Post-event
contamination is already prevented by the strict
pre-event-only window. A regression test ensures the
parameter cannot be silently re-introduced without
operationalising it. Test: test_event_exclusion_param_removed.
P1-5 — classification metrics input policy
Old code did np.asarray(..., dtype=bool) which silently
coerced -1, 2, 0.5 etc. to True. Now: bool arrays accepted
verbatim; integer arrays must contain only {0, 1}; everything
else raises ValueError. Tests:
- test_arbitrary_numeric_input_rejected (float input rejected)
- test_out_of_range_int_rejected (int 2 rejected)
- test_binary_int_input_accepted (int 0/1 works)
Tests: 231 passing (+13 from 218).
Quality: mypy --strict / ruff / black clean on the diff.
Status preserved: HYPOTHESIS / SCORE-LEVEL INSTRUMENTATION
EXTENSION ONLY. End-to-end validation remains pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Score-level instrumentation extension. The status remains
HYPOTHESIS / SCORE-LEVEL INSTRUMENTATION EXTENSION ONLY; end-to-end validation remains pending.What changed
CSD indicators (
critical_slowing_down.py)CSDConfigpre-registers window, min_periods, ddof, lag, constant_policy.compute_csd_indicatorsreturns variance + lag-1 autocorr + skewness + valid_count.test_no_lookahead_leakage: mutating a future segment leaves every past indicator value bit-identical.Naive baselines (
baselines.py)rolling_volatility_score— pure trailing σ. The "is it just volatility?" challenger.edge_density_score— per-snapshot directed/undirected edge density. The "is it just topology densification?" challenger.Extended metrics (
metrics.py)ClassificationMetrics— precision/recall/FPR/FNR with NaN-not-zero on every undefined denominator.LeadTimeConfig— pre-registeredmin_lead_days/max_lead_days/event_exclusion_days_after.LeadTimeMetrics— aggregate; first valid pre-event signal wins.min_lead_days=1); post-event signals never count.Audit-grade docs (7 new)
BASELINES.md,METRICS.md,NULL_MODELS.md,FAILURE_MODES.md,REPRODUCIBILITY.md,BOOTSTRAP_PROTOCOL.md,CHANGELOG.md. Every PENDING artefact named with its blocker.Test plan
pytest tests/research/systemic_risk/: 218 passed (+49 new).mypy --strict/ruff/blackclean on the diff.test_no_lookahead_leakage— past indicators bit-identical under future-segment mutation.run_premerge_science_gateagainst the live tree →passed=True,overclaim_hits=().C-SYSRISK-PHASEremainsHYPOTHESISinCLAIMS.md.Allowed final decision
MERGE AS HYPOTHESIS / SCORE-LEVEL INSTRUMENTATION EXTENSION ONLY.🤖 Generated with Claude Code