Author: ZeD@UChicago <zed.uchicago.edu>
Description: Tools for ML statistics
Documentation: https://zeroknowledgediscovery.github.io/zedstat/
Example: https://github.com/zeroknowledgediscovery/zedstat/blob/master/examples/example1.ipynb
from zedstat import zedstat
zt = zedstat.processRoc(
df=pd.read_csv('roc.csv'),
order=3,
total_samples=100000,
positive_samples=100,
alpha=0.01,
prevalence=.002,
)
zt.smooth(STEP=0.001)
zt.allmeasures(interpolate=True)
zt.usample(precision=3)
zt.getBounds()
print(zt.auc())
# find the high precision and high sensitivity operating points
zt.operating_zone(LRminus=.65)
rf0, txt0, _ = zt.interpret(fpr=zt._operating_zone.fpr.values[0], number_of_positives=10)
rf1, txt1, _ = zt.interpret(fpr=zt._operating_zone.fpr.values[1], number_of_positives=10)
display(zt._operating_zone)
print('high precision operation:\n', '\n '.join(txt0))
print('high recall operation:\n', '\n '.join(txt1))import pandas as pd
from zedstat import zedstat
roc_df = pd.read_csv("roc.csv")
zt = zedstat.processRoc(
df=roc_df,
order=3,
total_samples=100000,
positive_samples=100,
alpha=0.01,
prevalence=0.002,
)
zt.smooth(STEP=0.001)
zt.allmeasures(interpolate=True)
zt.usample(precision=3)
zt.getBounds()
out = zt.get().join(zt.df_lim["U"], rsuffix="_upper").join(zt.df_lim["L"], rsuffix="_lower")
out.to_csv("roc_operating_characteristics.csv")example_scores = [0.10, 0.20, 0.30]
ppv_at_threshold = zt.score_to_threshold_ppv(
example_scores,
regen=True,
STEP=0.001,
precision=3,
interpolate=True,
convexify=False,
)
threshold_ppv_df = pd.DataFrame({
"score": example_scores,
"threshold_ppv": ppv_at_threshold,
})
display(threshold_ppv_df)from zedstat import calibration
res = calibration.heldout_isotonic_calibration_with_bootstrap(
df,
score_col="predicted_risk",
label_col="target",
test_size=0.25,
random_state=4,
lower_score_is_risk=False,
target_prevalence=None,
n_bins=100,
n_boot=1000,
calibration_df_path="calibration_df_SISA.csv",
plot="calibration_SISA.pdf",
)
print(res["summary"])
display(res["calibration_table"])import numpy as np
import pandas as pd
example_scores = [0.10, 0.20, 0.30]
example_scores_arr = np.asarray(example_scores, dtype=float)
calibrated_probs = res["iso_model"].predict(example_scores_arr)
calibrated_df = pd.DataFrame({
"score": example_scores,
"calibrated_probability": np.asarray(calibrated_probs, dtype=float),
})
display(calibrated_df)def format_calibration_summary_df(summary):
import numpy as np
import pandas as pd
summary = pd.Series(summary)
ci_map = {
"auc_raw_test": ("auc_raw_ci_low", "auc_raw_ci_high"),
"auc_calibrated_test": ("auc_calibrated_ci_low", "auc_calibrated_ci_high"),
"brier_raw_test": ("brier_raw_ci_low", "brier_raw_ci_high"),
"brier_calibrated_test": ("brier_calibrated_ci_low", "brier_calibrated_ci_high"),
"calibration_intercept_test": ("calibration_intercept_ci_low", "calibration_intercept_ci_high"),
"calibration_slope_test": ("calibration_slope_ci_low", "calibration_slope_ci_high"),
}
skip_keys = {
"auc_raw_ci_low", "auc_raw_ci_high",
"auc_calibrated_ci_low", "auc_calibrated_ci_high",
"brier_raw_ci_low", "brier_raw_ci_high",
"brier_calibrated_ci_low", "brier_calibrated_ci_high",
"calibration_intercept_ci_low", "calibration_intercept_ci_high",
"calibration_slope_ci_low", "calibration_slope_ci_high",
}
rows = []
for key, val in summary.items():
if key in skip_keys:
continue
value_str = "" if pd.isna(val) else f"{float(val):.3f}"
if key in ci_map:
lo_key, hi_key = ci_map[key]
lo = summary.get(lo_key, np.nan)
hi = summary.get(hi_key, np.nan)
if pd.notna(lo) and pd.notna(hi):
value_str = f"{float(val):.3f} ({float(lo):.3f}, {float(hi):.3f})"
rows.append({"variable": str(key), "value": value_str})
return pd.DataFrame(rows, columns=["variable", "value"])
summary_df = format_calibration_summary_df(res["summary"])
display(summary_df)processRoc is the main ROC post-processing class. It takes an empirical ROC curve with false positive rate and true positive rate, augments it with operating metrics, and allows interpolation, confidence bounds, and interpretation at chosen operating points.
This function regularizes the empirical ROC curve. If convexify=True, the upper ROC hull is computed so that dominated operating points are removed. If interpolate=True, the curve is resampled on a uniform false positive rate grid.
Let the ROC curve be represented as points
where
This computes threshold-level operating measures from sensitivity, specificity, and prevalence.
Let
and let prevalence be
Then
These are threshold-level decision quantities. They describe the performance of classifying everyone whose score crosses the threshold.
This resamples the metric tables on a uniform false positive rate grid, typically for downstream lookup and plotting. The grid spacing is controlled by the decimal precision.
This computes pointwise confidence bounds for the operating measures using Wilson intervals for sensitivity and specificity, then propagates these to PPV, NPV, accuracy, and likelihood ratios.
If
and for
The derived measures are then bounded by substituting lower and upper values of sensitivity and specificity into the formulas above.
The area under the ROC curve is
where
This identifies practically useful threshold regions subject to likelihood-ratio constraints, for example high precision or high sensitivity operating points. Internally, the method searches the set of thresholds satisfying
for user-specified constants
This converts operating characteristics into expected counts for an interpretable hypothetical population.
If
Given sensitivity and PPV at the selected operating point, the method estimates true positives, false positives, false negatives, total flags, and number needed to screen.
The calibration utilities are provided in zedstat.calibration.
Given a score
The fitted function
The calibrated probability at score
For continuous scores this can be interpreted as
This is a local score-level quantity.
These are different objects.
Threshold PPV at threshold
when higher scores indicate higher risk. This is a tail-average quantity:
E[m(S)\mid S\ge t]. $$
Thus, calibrated probability is local, while threshold PPV is cumulative over all subjects beyond the threshold.
The Brier score evaluates probability accuracy:
where
A well-calibrated model should satisfy
A practical assessment regresses the observed outcome on the logit of the predicted probability:
Here:
-
$a$ is the calibration intercept. Ideal value is 0. -
$b$ is the calibration slope. Ideal value is 1.
An intercept above 0 indicates underprediction on average. An intercept below 0 indicates overprediction on average. A slope below 1 indicates overly extreme predictions; a slope above 1 indicates predictions compressed toward the center.
Predicted probabilities are grouped into bins, and for each bin the observed event rate is estimated. If a bin contains
Wilson confidence intervals are computed for each bin and displayed as vertical error bars in the reliability diagram.
The sample size utilities in zedstat use AUC-based approximations. Let the target AUC be
In the balanced-design approximation, the required sample size per class for resolving an AUC tolerance
where
When prevalence is specified, the code uses a prevalence-aware total sample size formula derived from the Hanley-McNeil variance approximation.
zedstat separates two different but complementary notions of risk:
- threshold-level decision utility, such as PPV, NPV, and likelihood ratios, derived from the ROC curve and prevalence;
- score-level probability interpretation, obtained through calibration.
The first is useful for screening policy and operating-point selection. The second is useful when the score must be interpreted as an individual event probability.
