Metrics Module

Calibration and multicalibration evaluation metrics.

This module provides a comprehensive suite of metrics for evaluating the calibration quality of probabilistic predictions, with a focus on multicalibration—calibration across multiple subpopulations.

Key metric families include:

Calibration Error Metrics: Standard and adaptive calibration error measures using binning approaches.
ECCE (Estimated Cumulative Calibration Error) Metrics: Statistical tests and metrics based on the ECCE statistic (also known as Kuiper calibration).
Ranking Metrics: Evaluation metrics for ranked predictions (DCG, NDCG, etc.).
Classification Metrics: Standard precision, recall, and threshold-based metrics.

mcgrad.metrics.expected_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Expected Calibration Error (ECE).

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.
num_bins (int) – Number of bins to use for bucketing predictions.
epsilon (float) – Small value to avoid numerical issues at bin boundaries.
adjust_unjoined (bool) – Whether to adjust for unjoined data.

Return type:

Returns:

The expected calibration error.

mcgrad.metrics.proportional_expected_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Proportional Expected Calibration Error.

Uses proportional error instead of absolute error for bin error calculation.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.
num_bins (int) – Number of bins to use for bucketing predictions.
epsilon (float) – Small value to avoid numerical issues at bin boundaries.
adjust_unjoined (bool) – Whether to adjust for unjoined data.

Return type:

Returns:

The proportional expected calibration error.

mcgrad.metrics.adaptive_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Adaptive Calibration Error (ACE).

Unlike ECE which uses equispaced bins, ACE uses bins with equal numbers of samples.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.
num_bins (int) – Number of bins to use for bucketing predictions.
epsilon (float) – Small value to avoid numerical issues at bin boundaries.
adjust_unjoined (bool) – Whether to adjust for unjoined data.

Return type:

Returns:

The adaptive calibration error.

mcgrad.metrics.proportional_adaptive_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Proportional Adaptive Calibration Error.

Combines adaptive binning with proportional error calculation.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.
num_bins (int) – Number of bins to use for bucketing predictions.
epsilon (float) – Small value to avoid numerical issues at bin boundaries.
adjust_unjoined (bool) – Whether to adjust for unjoined data.

Return type:

Returns:

The proportional adaptive calibration error.

mcgrad.metrics.calibration_ratio(labels, predicted_scores, sample_weight=None, adjust_unjoined=False, **kwargs)[source]

Calculate the calibration ratio (sum of predictions / sum of labels).

A value of 1.0 indicates perfect calibration on aggregate.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.
adjust_unjoined (bool) – Whether to adjust for unjoined data.

Return type:

Returns:

The calibration ratio. Returns np.inf if labels sum to zero but predictions sum to a positive value. Returns np.nan if both labels and predictions sum to zero.

mcgrad.metrics.recall(labels, predicted_labels, sample_weight=None, **kwargs)[source]

Calculate recall (true positive rate).

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted binary labels.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

The recall score.

mcgrad.metrics.precision(labels, predicted_labels, precision_weight=None, **kwargs)[source]

Calculate precision (positive predictive value).

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted binary labels.
precision_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights for precision calculation.

Return type:

Returns:

The precision score.

mcgrad.metrics.youdens_j(labels, predicted_scores, sample_weight=None, **kwargs)[source]

Calculate the continuous net detection rate (Youden’s J).

Computes E[scores | positive label] - E[scores | negative label]. When the scores are binary predictions this reduces to the classical TPR - FPR. The value ranges from -1 to 1, where 1 indicates perfect separation, 0 indicates no discriminative ability, and negative values indicate an inverted model.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted scores (continuous or binary).
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

The continuous net detection rate.

mcgrad.metrics.fpr(labels, predicted_labels, sample_weight=None, **kwargs)[source]

Calculate the false positive rate (FPR).

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted binary labels.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

The false positive rate.

mcgrad.metrics.fpr_with_mask(y_true, y_pred, y_mask, sample_weight, denominator)[source]

Calculate the false positive rate with a mask applied.

Only samples where y_mask is True are considered when counting false positives. This is useful for computing FPR within a specific segment or subpopulation while using a shared denominator across segments.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
y_pred (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted binary labels.
y_mask (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Boolean mask array indicating which samples to include in the false positive count. Only samples where mask is True contribute to the numerator.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of sample weights.
denominator (float) – The denominator to use for FPR calculation (typically the weighted count of true negatives, possibly computed over a broader population).

Return type:

float | None

Returns:

The false positive rate, or None if denominator is zero.

mcgrad.metrics.dcg_score(labels, predicted_labels, rank_discount=<function rank_no_discount>, k=None)[source]

Calculates the DCG score: https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Discounted_Cumulative_Gain.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted labels.
rank_discount (Callable[[int], ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]]) – Function that takes the number of samples as input and returns an array of size n_samples with the discount factor for each sample.
k (int | None) – If not None, the DCG score is calculated only based on the top k samples. k cannot be smaller than 1 and cannot be larger than the number of samples.

Return type:

Returns:

the DCG score as a float, or np.nan if the input arrays are empty.

mcgrad.metrics.ndcg_score(labels, predicted_labels, rank_discount=<function rank_no_discount>, k=None, sample_weight=None)[source]

Calculates the Normalized Discounted Cumulative Gain (NDCG)

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted labels.
rank_discount (Callable[[int], ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]]) – Function that takes the number of samples as input and returns an array of size n_samples with the discount factor for each sample.
k (int | None) – If not None, the NDCG score is calculated only based on the top k samples. k cannot be smaller than 1 and cannot be larger than the number of samples.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights. Currently unused but included for API consistency.

Return type:

Returns:

the NDCG score as a float in [0,1], or np.nan if the input arrays are empty.

mcgrad.metrics.recall_at_precision(y_true, y_scores, precision_target=0.95, sample_weight=None)[source]

Calculate the maximum recall at a given precision threshold.

Parameters:

y_true (Union[_Buffer, _SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], complex, bytes, str, _NestedSequence[complex | bytes | str]]) – Array of true binary labels.
y_scores (Union[_Buffer, _SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], complex, bytes, str, _NestedSequence[complex | bytes | str]]) – Array of predicted probability scores.
precision_target (float) – Minimum precision threshold to achieve.
sample_weight (Union[_Buffer, _SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], complex, bytes, str, _NestedSequence[complex | bytes | str], None]) – Optional array of sample weights.

Return type:

Returns:

Maximum recall achievable at the precision target, or 0 if unachievable.

mcgrad.metrics.precision_at_predictive_prevalence(y_true, y_scores, predictive_prevalence_target=0.95, sample_weight=None)[source]

Calculate precision at a given predictive prevalence threshold.

Predictive prevalence is the fraction of samples predicted as positive.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
y_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
predictive_prevalence_target (float) – Target fraction of samples to predict as positive.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

Maximum precision at the target predictive prevalence, or np.nan if no threshold can achieve the target prevalence.

mcgrad.metrics.precision_at_recall(y_true, y_scores, recall_target=0.95, sample_weight=None)[source]

Calculate the maximum precision at a given recall threshold.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
y_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
recall_target (float) – Minimum recall threshold to achieve.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

Maximum precision at the recall target, or 0 if unachievable.

mcgrad.metrics.fpr_at_precision(y_true, y_scores, precision_target=0.95, sample_weight=None)[source]

Calculate the false positive rate at a given precision threshold.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels.
y_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probability scores.
precision_target (float) – Minimum precision threshold to achieve.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

False positive rate at the precision target, or NaN if unachievable.

mcgrad.metrics.multi_cg_score(labels, predictions, segments_df, metric=<function ndcg_score>, rank_discount=<function rank_no_discount>, k=None)[source]

Calculates the metric score for each segment.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predictions (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted scores.
segments_df (DataFrame) – Dataframe with the segments to calculate the error
metric (_MulticalibrationRankErrorMetricsInterface) – The cumulative gain metric to use. Defaults to ndcg_score.
rank_discount (Callable[[int], ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]]) – rank discount function of the metric. Defaults to no discount.
k (int | None) – If not None, the metric is calculated only based on the top k samples. k cannot be smaller than 1 and cannot be larger than the number of samples.

Return type:

Returns:

a Series of size n_segments with the metric score for each segment.

mcgrad.metrics.ecce_pvalue_from_sigma(ecce_sigma)[source]

Compute p-value from a sigma-scaled ECCE statistic.

Accepts both scalar and array inputs.

Parameters:: ecce_sigma (float | ndarray[tuple[Any, ...], dtype[Any]]) – The ECCE statistic normalized by standard deviation. Can be a scalar or array.
Return type:: float | ndarray[tuple[Any, ...], dtype[float64]]
Returns:: The p-value(s) from the ECCE test. Returns a scalar when the input is scalar, an array otherwise.

mcgrad.metrics.ecce(labels, predicted_scores, sample_weight=None)[source]

Calculate the Estimated Cumulative Calibration Error (ECCE) [1].

ECCE measures the maximum deviation between the cumulative distribution of predicted probabilities for positive and negative examples. It is equivalent to the unnormalized Kuiper calibration statistic.

[1]: Arrieta-Ibarra, I., Gujral, P., Tannen, J., Tygert, M., & Xu, C. (2022). Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 23(351), 1-54. (https://tygert.com/ece.pdf)

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels (0 or 1).
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probabilities.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

The ECCE value.

mcgrad.metrics.ecce_sigma(labels, predicted_scores, sample_weight=None)[source]

Calculate the ECCE normalized by standard deviation.

This returns the ECCE statistic normalized by the standard deviation of the calibration error under the null hypothesis of perfect calibration.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels (0 or 1).
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probabilities.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

The normalized ECCE value.

mcgrad.metrics.ecce_pvalue(labels, predicted_scores, sample_weight=None)[source]

Calculate the p-value for the ECCE statistic.

Tests the null hypothesis that predictions are perfectly calibrated using the Kuiper test.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true binary labels (0 or 1).
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted probabilities.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights.

Return type:

Returns:

The p-value from the calibration test.

mcgrad.metrics.rank_calibration_error(labels, predicted_labels, num_bins=40)[source]

Calculates rank calibration error as proposed in: https://arxiv.org/pdf/2404.03163

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted labels.
num_bins (int) – Number of bins to use for the rank calibration error calculation.

Return type:

Returns:

float indicating the rank calibration error

mcgrad.metrics.rank_multicalibration_error(labels, predicted_labels, segments_df, num_bins=40)[source]

Calculates rank calibration error for each segment.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.
predicted_labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted labels.
segments_df (DataFrame) – Dataframe with the segments to calculate the error
num_bins (int) – Number of bins to use for the rank calibration error calculation.

Return type:

Returns:

float representing the weighted average of rank calibration errors across all segments.

mcgrad.metrics.normalized_entropy(labels, predicted_scores, sample_weight=None)[source]

Calculates the normalized entropy, defined as the ratio between the prediction’s log loss (binary cross entropy) and the log loss obtained from fixed predictions equal to the test set prevalence.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Ground truth (correct) labels for n_samples samples.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Predicted probabilities, as returned by a classifier’s predict_proba method.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights for each instance.

Return type:

Returns:

the normalized entropy

mcgrad.metrics.calibration_free_normalized_entropy(labels, predicted_scores, sample_weight=None, tolerance=1e-05, max_iter=10000)[source]

Calculates the Calibration-Free normalized entropy.

Parameters:

labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Ground truth (correct) labels for n_samples samples.
predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Predicted probabilities, as returned by a classifier’s predict_proba method.
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights for each instance.
tolerance (float) – Convergence tolerance for the iterative calibration adjustment. Defaults to 1e-5.
max_iter (int) – Maximum number of iterations for the calibration adjustment. Defaults to 10000.

Return type:

Returns:

the calibration-free NE.

class mcgrad.metrics.MulticalibrationError(df, label_column, score_column, weight_column=None, categorical_segment_columns=None, numerical_segment_columns=None, max_depth=3, max_values_per_segment_feature=3, min_samples_per_segment=10, max_n_segments=1000, chunk_size=50, precision_dtype='float32', outcome_type='classification')[source]

Bases: object

Evaluates calibration quality across multiple subpopulations (segments).

Multicalibration error (MCE) (introduced in [1]) extends traditional calibration metrics by measuring calibration not just globally, but across many automatically-generated segments of the data. This helps identify subpopulations where a model may be poorly calibrated even when global calibration appears good.

Supports both classification (binary outcomes) and regression (continuous outcomes) via the outcome_type parameter. For classification, the variance estimator uses the Bernoulli variance p(1-p). For regression, a successive-differences estimator (Appendix C, equation 6 of [1]) is used instead, which does not require knowing the true conditional variance of the outcome.

The metric is based on ECCE (Estimated Cumulative Calibration Error) [2].

[1] Guy, I., Haimovich, D., Linder, F., Okati, N., Perini, L., Tax, N., & Tygert, M. (2025). Measuring multi-calibration. arXiv preprint arXiv:2506.11251. (https://arxiv.org/abs/2506.11251)

[2] Arrieta-Ibarra, I., Gujral, P., Tannen, J., Tygert, M., & Xu, C. (2022). Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 23(351), 1-54. (https://www.jmlr.org/papers/volume23/22-0658/22-0658.pdf)

Key Concepts

Segments

Subpopulations defined by combinations of feature values. Segments are generated automatically from categorical and numerical features at various depths (single features, pairs, triplets, etc.).

Scales

Results are available in four scales:

Absolute: Raw ECCE value (same units as predictions, typically 0-1)
Relative (%): Percentage of prevalence, easier to interpret across datasets
Sigma: Statistical significance, values > 2 suggest miscalibration
P-value: Probability of seeing this ECCE under perfect calibration

MCE vs Global ECCE

global_ecce: ECCE computed on the entire dataset (no segmentation)
mce: Largest ECCE across all segments

Interpreting Results

mce_pvalue < 0.05: Statistically significant miscalibration detected
mce_sigma > 2: Miscalibration exceeds 2 standard deviations
mce_relative: Miscalibration as percentage of prevalence (e.g., 5% means predictions are off by 5% of the base rate in the worst segment)
mde_relative: Approximation of the minimum detectable error - miscalibration smaller than this cannot be reliably detected given the sample size

__init__(df, label_column, score_column, weight_column=None, categorical_segment_columns=None, numerical_segment_columns=None, max_depth=3, max_values_per_segment_feature=3, min_samples_per_segment=10, max_n_segments=1000, chunk_size=50, precision_dtype='float32', outcome_type='classification')[source]

Initialize MulticalibrationError with data and segmentation parameters.

Parameters:

df (DataFrame) – DataFrame containing predictions, labels, and segment features.
label_column (str) – Column name containing labels. For classification, binary labels (0 or 1). For regression, continuous outcome values.
score_column (str) – Column name containing predicted scores. For classification, predicted probabilities (0 to 1). For regression, continuous predictions.
weight_column (str | None) – Optional column containing sample weights. If None, all samples are weighted equally.
categorical_segment_columns (list[str] | None) – Columns with categorical values to use for segmentation (e.g., country, device_type). Each unique value becomes a potential segment boundary.
numerical_segment_columns (list[str] | None) – Columns with numerical values to use for segmentation. Values are automatically quantile-binned.
max_depth (int | None) – Maximum depth of segment combinations. Depth 0 = global only, depth 1 = single features, depth 2 = pairs of features, etc. Higher depths find more granular miscalibration but increase computation.
max_values_per_segment_feature (int) – Maximum unique values per feature to consider. Features with more unique values are binned.
min_samples_per_segment (int) – Minimum samples required for a segment to be included. Smaller segments are excluded to reduce noise.
max_n_segments (int | None) – Maximum total segments to evaluate. Limits computation time for large feature spaces. Set to None for no limit.
chunk_size (int) – Number of segments to process per batch. Larger values improve speed but increase memory usage.
precision_dtype (str) – Floating-point precision for computations. Options: ‘float16’ (fast, less precise), ‘float32’ (balanced), ‘float64’ (precise).
outcome_type (str) – Type of outcome variable. Options: ‘classification’ for binary outcomes with Bernoulli variance, ‘regression’ for continuous outcomes using successive-differences variance estimation.

property segments_ecce: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

ECCE values per segment on absolute scale.

Returns an array where each element is the Estimated Cumulative Calibration Error for one segment. The first element (index 0) is always the global segment (entire dataset).

Values are in the same units as predictions (typically 0-1). Larger values indicate worse calibration in that segment.

Returns:: Array of shape (n_segments,) with ECCE values.

property segments_ecce_relative: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

ECCE values per segment on relative (prevalence-adjusted percentage) scale.

Values are expressed as a percentage of the label prevalence, making them easier to interpret across datasets with different base rates. For example, a value of 10 means the calibration error is 10% of the prevalence.

Returns:: Array of shape (n_segments,) with relative ECCE values (%).

property global_ecce: float

Global ECCE on absolute scale.

ECCE computed on the entire dataset without segmentation. This is equivalent to segments_ecce[0] since the first segment is always the global segment.

Use this to assess overall model calibration before looking at segment-level miscalibration.

Returns:: Global ECCE value.

property global_ecce_relative: float

Global ECCE on relative (prevalence-adjusted percentage) scale.

ECCE computed on the entire dataset without segmentation. This is equivalent to segments_ecce_relative[0] since the first segment is always the global segment.

Use this to assess overall model calibration before looking at segment-level miscalibration.

Returns:: Global ECCE as percentage of prevalence.

property global_ecce_sigma: float

Global ECCE on sigma scale.

Indicates the statistical significance of the global ECCE value. Values above 5 indicate strong evidence of miscalibration.

Returns:: Global ECCE in standard deviations.

property global_ecce_pvalue: float

P-value for global ECCE calibration test.

The probability of observing this ECCE value (or larger) if the model were perfectly calibrated.

Returns:: p-value between 0 and 1.

property segments_ecce_sigma: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

ECCE values per segment on sigma scale.

Each value represents how many standard deviations the observed ECCE is from zero (perfect calibration). Values above 5 indicate strong evidence of miscalibration. You can plot the distribution of these values with plotting.plot_segment_calibration_errors.

Returns:: Array of shape (n_segments,).

property segments_ecce_pvalue: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

p-values per segment for ECCE calibration test.

Each value is the probability of observing the corresponding ECCE (or larger) if the model were perfectly calibrated in that segment.

Returns:: Array of shape (n_segments,) with p-values between 0 and 1.

property mce_sigma: float

Multicalibration error on sigma scale.

The largest ECCE-sigma across all segments. This identifies the segment with the most statistically significant miscalibration. Values above 5 indicate significant miscalibration.

Returns:: Maximum segment ECCE-sigma.

property mce: float

Multicalibration error on absolute scale.

The largest ECCE across all segments, converted to absolute scale. This represents the worst-case calibration error found in any segment.

Returns:: Maximum ECCE value.

property mce_relative: float

Multicalibration error on relative (prevalence-adjusted percentage) scale.

The MCE expressed as a percentage of label prevalence. For example, if prevalence is 10% and mce_relative is 20, the worst segment has predictions off by 2 percentage points (20% of 10%).

This is often the most interpretable metric for comparing calibration across different datasets or use cases.

Returns:: MCE as percentage of prevalence.

property mce_pvalue: float

p-value for the multicalibration error.

The probability of observing this MCE (or larger) if the model were perfectly calibrated across all segments. This is the minimum p-value across all segments.

Note that this p-value is not adjusted for multiple testing, so the Type I error rate (concluding there is miscalibration when there is none) will be higher in practice. We expect any required adjustment to be small because the hypotheses are highly correlated (many segments overlap). We therefore did not apply common corrections such as Bonferroni, as they would be overly conservative and could substantially increase Type II errors (failing to detect miscalibration when it exists).

Returns:: p-value between 0 and 1.

property mde: float

Minimum detectable error on absolute (probability) scale.

The MDE represents the smallest calibration error that would be statistically detectable at approximately 5 sigma significance, given the sample size. Expressed as an absolute probability difference.

Returns:: Minimum detectable error as an absolute probability difference.

property mde_relative: float

Minimum detectable error on relative (prevalence-adjusted percentage) scale.

The smallest calibration error that can be reliably detected given the sample size and variance in the data. Miscalibration smaller than this value may not be statistically significant even if present.

Based on a 5-sigma detection threshold (very high confidence).

Returns:: MDE as percentage of prevalence.

mcgrad.metrics.soft_label_log_loss(y_true, y_pred, sample_weight=None)[source]

Binary cross-entropy loss that supports soft labels in [0, 1].

Equivalent to sklearn’s log_loss for hard binary labels, but also supports continuous labels (e.g. confidence scores) in the [0, 1] interval.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Ground truth labels, either binary {0, 1} or soft floats in [0, 1].
y_pred (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Predicted probabilities in (0, 1).
sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional sample weights.

Return type:

Returns:

Weighted mean cross-entropy loss.

mcgrad.metrics.wrap_sklearn_metric_func(func)[source]

Wrap an sklearn-style metric function for use with the evaluation framework.

Parameters:: func (Callable[..., float]) – A function with signature (y_true, y_pred, sample_weight=None) -> float.
Return type:: _ScoreFunctionInterface
Returns:: A ScoreFunctionInterface-compatible wrapper.

mcgrad.metrics.wrap_multicalibration_error_metric(categorical_segment_columns=None, numerical_segment_columns=None, max_depth=3, max_values_per_segment_feature=3, min_samples_per_segment=10, max_n_segments=1000, metric_version='mce_relative', outcome_type='classification')[source]

Create a wrapped MulticalibrationError metric for use with the evaluation framework.

Parameters:

categorical_segment_columns (list[str] | None) – Columns to use for categorical segmentation.
numerical_segment_columns (list[str] | None) – Columns to use for numerical segmentation.
max_depth (int) – Maximum depth for segment generation.
max_values_per_segment_feature (int) – Max unique values per segment feature.
min_samples_per_segment (int) – Minimum samples required per segment.
max_n_segments (int | None) – Maximum number of segments to generate.
metric_version (str) –
Which metric to return. Options: - ‘mce_relative’: relative (prevalence-adjusted percentage) scale (default,

classification only)
- ’mce’: absolute scale
- ’mce_sigma’: sigma (z-score) scale
- ’mce_pvalue’: p-value
Legacy names are also supported but deprecated: - ‘mce_sigma_scale’ -> ‘mce_sigma’ - ‘mce_absolute’ -> ‘mce’ - ‘p_value’ -> ‘mce_pvalue’
outcome_type (str) – Type of outcome variable. Options: ‘classification’ (default) or ‘regression’. When ‘regression’, ‘mce_relative’ is not available.

Return type:

_ScoreFunctionInterface

Returns:

A ScoreFunctionInterface-compatible wrapper.