Metrics Module

Calibration and multicalibration evaluation metrics.

This module provides a comprehensive suite of metrics for evaluating the calibration quality of probabilistic predictions, with a focus on multicalibration—calibration across multiple subpopulations.

Key metric families include:

Calibration Error Metrics

Standard and adaptive calibration error measures using binning approaches.

ECCE (Estimated Cumulative Calibration Error) Metrics

Statistical tests and metrics based on the ECCE statistic (also known as Kuiper calibration).

Ranking Metrics

Evaluation metrics for ranked predictions (DCG, NDCG, etc.).

Classification Metrics

Standard precision, recall, and threshold-based metrics.

mcgrad.metrics.expected_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Expected Calibration Error (ECE).

Parameters:
Return type:

float

Returns:

The expected calibration error.

mcgrad.metrics.proportional_expected_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Proportional Expected Calibration Error.

Uses proportional error instead of absolute error for bin error calculation.

Parameters:
Return type:

float

Returns:

The proportional expected calibration error.

mcgrad.metrics.adaptive_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Adaptive Calibration Error (ACE).

Unlike ECE which uses equispaced bins, ACE uses bins with equal numbers of samples.

Parameters:
Return type:

float

Returns:

The adaptive calibration error.

mcgrad.metrics.proportional_adaptive_calibration_error(labels, predicted_scores, sample_weight=None, num_bins=40, epsilon=1e-07, adjust_unjoined=False, **kwargs)[source]

Calculate the Proportional Adaptive Calibration Error.

Combines adaptive binning with proportional error calculation.

Parameters:
Return type:

float

Returns:

The proportional adaptive calibration error.

mcgrad.metrics.calibration_ratio(labels, predicted_scores, sample_weight=None, adjust_unjoined=False, **kwargs)[source]

Calculate the calibration ratio (sum of predictions / sum of labels).

A value of 1.0 indicates perfect calibration on aggregate.

Parameters:
Return type:

float

Returns:

The calibration ratio. Returns np.inf if labels sum to zero but predictions sum to a positive value. Returns np.nan if both labels and predictions sum to zero.

mcgrad.metrics.recall(labels, predicted_labels, sample_weight=None, **kwargs)[source]

Calculate recall (true positive rate).

Parameters:
Return type:

float

Returns:

The recall score.

mcgrad.metrics.precision(labels, predicted_labels, precision_weight=None, **kwargs)[source]

Calculate precision (positive predictive value).

Parameters:
Return type:

float

Returns:

The precision score.

mcgrad.metrics.youdens_j(labels, predicted_scores, sample_weight=None, **kwargs)[source]

Calculate the continuous net detection rate (Youden’s J).

Computes E[scores | positive label] - E[scores | negative label]. When the scores are binary predictions this reduces to the classical TPR - FPR. The value ranges from -1 to 1, where 1 indicates perfect separation, 0 indicates no discriminative ability, and negative values indicate an inverted model.

Parameters:
Return type:

float

Returns:

The continuous net detection rate.

mcgrad.metrics.fpr(labels, predicted_labels, sample_weight=None, **kwargs)[source]

Calculate the false positive rate (FPR).

Parameters:
Return type:

float

Returns:

The false positive rate.

mcgrad.metrics.fpr_with_mask(y_true, y_pred, y_mask, sample_weight, denominator)[source]

Calculate the false positive rate with a mask applied.

Only samples where y_mask is True are considered when counting false positives. This is useful for computing FPR within a specific segment or subpopulation while using a shared denominator across segments.

Parameters:
Return type:

float | None

Returns:

The false positive rate, or None if denominator is zero.

mcgrad.metrics.dcg_score(labels, predicted_labels, rank_discount=<function rank_no_discount>, k=None)[source]

Calculates the DCG score: https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Discounted_Cumulative_Gain.

Parameters:
Return type:

float

Returns:

the DCG score as a float, or np.nan if the input arrays are empty.

mcgrad.metrics.ndcg_score(labels, predicted_labels, rank_discount=<function rank_no_discount>, k=None, sample_weight=None)[source]

Calculates the Normalized Discounted Cumulative Gain (NDCG)

Parameters:
Return type:

float

Returns:

the NDCG score as a float in [0,1], or np.nan if the input arrays are empty.

mcgrad.metrics.recall_at_precision(y_true, y_scores, precision_target=0.95, sample_weight=None)[source]

Calculate the maximum recall at a given precision threshold.

Parameters:
Return type:

float

Returns:

Maximum recall achievable at the precision target, or 0 if unachievable.

mcgrad.metrics.precision_at_predictive_prevalence(y_true, y_scores, predictive_prevalence_target=0.95, sample_weight=None)[source]

Calculate precision at a given predictive prevalence threshold.

Predictive prevalence is the fraction of samples predicted as positive.

Parameters:
Return type:

float

Returns:

Maximum precision at the target predictive prevalence, or np.nan if no threshold can achieve the target prevalence.

mcgrad.metrics.precision_at_recall(y_true, y_scores, recall_target=0.95, sample_weight=None)[source]

Calculate the maximum precision at a given recall threshold.

Parameters:
Return type:

float

Returns:

Maximum precision at the recall target, or 0 if unachievable.

mcgrad.metrics.fpr_at_precision(y_true, y_scores, precision_target=0.95, sample_weight=None)[source]

Calculate the false positive rate at a given precision threshold.

Parameters:
Return type:

float

Returns:

False positive rate at the precision target, or NaN if unachievable.

mcgrad.metrics.multi_cg_score(labels, predictions, segments_df, metric=<function ndcg_score>, rank_discount=<function rank_no_discount>, k=None)[source]

Calculates the metric score for each segment.

Parameters:
  • labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of true labels.

  • predictions (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Array of predicted scores.

  • segments_df (DataFrame) – Dataframe with the segments to calculate the error

  • metric (_MulticalibrationRankErrorMetricsInterface) – The cumulative gain metric to use. Defaults to ndcg_score.

  • rank_discount (Callable[[int], ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]]) – rank discount function of the metric. Defaults to no discount.

  • k (int | None) – If not None, the metric is calculated only based on the top k samples. k cannot be smaller than 1 and cannot be larger than the number of samples.

Return type:

Series

Returns:

a Series of size n_segments with the metric score for each segment.

mcgrad.metrics.ecce_pvalue_from_sigma(ecce_sigma)[source]

Compute p-value from a sigma-scaled ECCE statistic.

Accepts both scalar and array inputs.

Parameters:

ecce_sigma (float | ndarray[tuple[Any, ...], dtype[Any]]) – The ECCE statistic normalized by standard deviation. Can be a scalar or array.

Return type:

float | ndarray[tuple[Any, ...], dtype[float64]]

Returns:

The p-value(s) from the ECCE test. Returns a scalar when the input is scalar, an array otherwise.

mcgrad.metrics.ecce(labels, predicted_scores, sample_weight=None)[source]

Calculate the Estimated Cumulative Calibration Error (ECCE) [1].

ECCE measures the maximum deviation between the cumulative distribution of predicted probabilities for positive and negative examples. It is equivalent to the unnormalized Kuiper calibration statistic.

[1]: Arrieta-Ibarra, I., Gujral, P., Tannen, J., Tygert, M., & Xu, C. (2022). Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 23(351), 1-54. (https://tygert.com/ece.pdf)

Parameters:
Return type:

float

Returns:

The ECCE value.

mcgrad.metrics.ecce_sigma(labels, predicted_scores, sample_weight=None)[source]

Calculate the ECCE normalized by standard deviation.

This returns the ECCE statistic normalized by the standard deviation of the calibration error under the null hypothesis of perfect calibration.

Parameters:
Return type:

float

Returns:

The normalized ECCE value.

mcgrad.metrics.ecce_pvalue(labels, predicted_scores, sample_weight=None)[source]

Calculate the p-value for the ECCE statistic.

Tests the null hypothesis that predictions are perfectly calibrated using the Kuiper test.

Parameters:
Return type:

float

Returns:

The p-value from the calibration test.

mcgrad.metrics.rank_calibration_error(labels, predicted_labels, num_bins=40)[source]

Calculates rank calibration error as proposed in: https://arxiv.org/pdf/2404.03163

Parameters:
Return type:

float

Returns:

float indicating the rank calibration error

mcgrad.metrics.rank_multicalibration_error(labels, predicted_labels, segments_df, num_bins=40)[source]

Calculates rank calibration error for each segment.

Parameters:
Return type:

float

Returns:

float representing the weighted average of rank calibration errors across all segments.

mcgrad.metrics.normalized_entropy(labels, predicted_scores, sample_weight=None)[source]

Calculates the normalized entropy, defined as the ratio between the prediction’s log loss (binary cross entropy) and the log loss obtained from fixed predictions equal to the test set prevalence.

Parameters:
Return type:

float

Returns:

the normalized entropy

mcgrad.metrics.calibration_free_normalized_entropy(labels, predicted_scores, sample_weight=None, tolerance=1e-05, max_iter=10000)[source]

Calculates the Calibration-Free normalized entropy.

Parameters:
  • labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Ground truth (correct) labels for n_samples samples.

  • predicted_scores (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – Predicted probabilities, as returned by a classifier’s predict_proba method.

  • sample_weight (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None) – Optional array of sample weights for each instance.

  • tolerance (float) – Convergence tolerance for the iterative calibration adjustment. Defaults to 1e-5.

  • max_iter (int) – Maximum number of iterations for the calibration adjustment. Defaults to 10000.

Return type:

float

Returns:

the calibration-free NE.

class mcgrad.metrics.MulticalibrationError(df, label_column, score_column, weight_column=None, categorical_segment_columns=None, numerical_segment_columns=None, max_depth=3, max_values_per_segment_feature=3, min_samples_per_segment=10, max_n_segments=1000, chunk_size=50, precision_dtype='float32', outcome_type='classification')[source]

Bases: object

Evaluates calibration quality across multiple subpopulations (segments).

Multicalibration error (MCE) (introduced in [1]) extends traditional calibration metrics by measuring calibration not just globally, but across many automatically-generated segments of the data. This helps identify subpopulations where a model may be poorly calibrated even when global calibration appears good.

Supports both classification (binary outcomes) and regression (continuous outcomes) via the outcome_type parameter. For classification, the variance estimator uses the Bernoulli variance p(1-p). For regression, a successive-differences estimator (Appendix C, equation 6 of [1]) is used instead, which does not require knowing the true conditional variance of the outcome.

The metric is based on ECCE (Estimated Cumulative Calibration Error) [2].

[1] Guy, I., Haimovich, D., Linder, F., Okati, N., Perini, L., Tax, N., & Tygert, M. (2025). Measuring multi-calibration. arXiv preprint arXiv:2506.11251. (https://arxiv.org/abs/2506.11251)

[2] Arrieta-Ibarra, I., Gujral, P., Tannen, J., Tygert, M., & Xu, C. (2022). Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 23(351), 1-54. (https://www.jmlr.org/papers/volume23/22-0658/22-0658.pdf)

Key Concepts

Segments

Subpopulations defined by combinations of feature values. Segments are generated automatically from categorical and numerical features at various depths (single features, pairs, triplets, etc.).

Scales

Results are available in four scales:

  • Absolute: Raw ECCE value (same units as predictions, typically 0-1)

  • Relative (%): Percentage of prevalence, easier to interpret across datasets

  • Sigma: Statistical significance, values > 2 suggest miscalibration

  • P-value: Probability of seeing this ECCE under perfect calibration

MCE vs Global ECCE
  • global_ecce: ECCE computed on the entire dataset (no segmentation)

  • mce: Largest ECCE across all segments

Interpreting Results

  • mce_pvalue < 0.05: Statistically significant miscalibration detected

  • mce_sigma > 2: Miscalibration exceeds 2 standard deviations

  • mce_relative: Miscalibration as percentage of prevalence (e.g., 5% means predictions are off by 5% of the base rate in the worst segment)

  • mde_relative: Approximation of the minimum detectable error - miscalibration smaller than this cannot be reliably detected given the sample size

__init__(df, label_column, score_column, weight_column=None, categorical_segment_columns=None, numerical_segment_columns=None, max_depth=3, max_values_per_segment_feature=3, min_samples_per_segment=10, max_n_segments=1000, chunk_size=50, precision_dtype='float32', outcome_type='classification')[source]

Initialize MulticalibrationError with data and segmentation parameters.

Parameters:
  • df (DataFrame) – DataFrame containing predictions, labels, and segment features.

  • label_column (str) – Column name containing labels. For classification, binary labels (0 or 1). For regression, continuous outcome values.

  • score_column (str) – Column name containing predicted scores. For classification, predicted probabilities (0 to 1). For regression, continuous predictions.

  • weight_column (str | None) – Optional column containing sample weights. If None, all samples are weighted equally.

  • categorical_segment_columns (list[str] | None) – Columns with categorical values to use for segmentation (e.g., country, device_type). Each unique value becomes a potential segment boundary.

  • numerical_segment_columns (list[str] | None) – Columns with numerical values to use for segmentation. Values are automatically quantile-binned.

  • max_depth (int | None) – Maximum depth of segment combinations. Depth 0 = global only, depth 1 = single features, depth 2 = pairs of features, etc. Higher depths find more granular miscalibration but increase computation.

  • max_values_per_segment_feature (int) – Maximum unique values per feature to consider. Features with more unique values are binned.

  • min_samples_per_segment (int) – Minimum samples required for a segment to be included. Smaller segments are excluded to reduce noise.

  • max_n_segments (int | None) – Maximum total segments to evaluate. Limits computation time for large feature spaces. Set to None for no limit.

  • chunk_size (int) – Number of segments to process per batch. Larger values improve speed but increase memory usage.

  • precision_dtype (str) – Floating-point precision for computations. Options: ‘float16’ (fast, less precise), ‘float32’ (balanced), ‘float64’ (precise).

  • outcome_type (str) – Type of outcome variable. Options: ‘classification’ for binary outcomes with Bernoulli variance, ‘regression’ for continuous outcomes using successive-differences variance estimation.

property segments_ecce: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

ECCE values per segment on absolute scale.

Returns an array where each element is the Estimated Cumulative Calibration Error for one segment. The first element (index 0) is always the global segment (entire dataset).

Values are in the same units as predictions (typically 0-1). Larger values indicate worse calibration in that segment.

Returns:

Array of shape (n_segments,) with ECCE values.

property segments_ecce_relative: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

ECCE values per segment on relative (prevalence-adjusted percentage) scale.

Values are expressed as a percentage of the label prevalence, making them easier to interpret across datasets with different base rates. For example, a value of 10 means the calibration error is 10% of the prevalence.

Returns:

Array of shape (n_segments,) with relative ECCE values (%).

property global_ecce: float

Global ECCE on absolute scale.

ECCE computed on the entire dataset without segmentation. This is equivalent to segments_ecce[0] since the first segment is always the global segment.

Use this to assess overall model calibration before looking at segment-level miscalibration.

Returns:

Global ECCE value.

property global_ecce_relative: float

Global ECCE on relative (prevalence-adjusted percentage) scale.

ECCE computed on the entire dataset without segmentation. This is equivalent to segments_ecce_relative[0] since the first segment is always the global segment.

Use this to assess overall model calibration before looking at segment-level miscalibration.

Returns:

Global ECCE as percentage of prevalence.

property global_ecce_sigma: float

Global ECCE on sigma scale.

Indicates the statistical significance of the global ECCE value. Values above 5 indicate strong evidence of miscalibration.

Returns:

Global ECCE in standard deviations.

property global_ecce_pvalue: float

P-value for global ECCE calibration test.

The probability of observing this ECCE value (or larger) if the model were perfectly calibrated.

Returns:

p-value between 0 and 1.

property segments_ecce_sigma: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

ECCE values per segment on sigma scale.

Each value represents how many standard deviations the observed ECCE is from zero (perfect calibration). Values above 5 indicate strong evidence of miscalibration. You can plot the distribution of these values with plotting.plot_segment_calibration_errors.

Returns:

Array of shape (n_segments,).

property segments_ecce_pvalue: ndarray[tuple[Any, ...], dtype[float16 | float32 | float64]]

p-values per segment for ECCE calibration test.

Each value is the probability of observing the corresponding ECCE (or larger) if the model were perfectly calibrated in that segment.

Returns:

Array of shape (n_segments,) with p-values between 0 and 1.

property mce_sigma: float

Multicalibration error on sigma scale.

The largest ECCE-sigma across all segments. This identifies the segment with the most statistically significant miscalibration. Values above 5 indicate significant miscalibration.

Returns:

Maximum segment ECCE-sigma.

property mce: float

Multicalibration error on absolute scale.

The largest ECCE across all segments, converted to absolute scale. This represents the worst-case calibration error found in any segment.

Returns:

Maximum ECCE value.

property mce_relative: float

Multicalibration error on relative (prevalence-adjusted percentage) scale.

The MCE expressed as a percentage of label prevalence. For example, if prevalence is 10% and mce_relative is 20, the worst segment has predictions off by 2 percentage points (20% of 10%).

This is often the most interpretable metric for comparing calibration across different datasets or use cases.

Returns:

MCE as percentage of prevalence.

property mce_pvalue: float

p-value for the multicalibration error.

The probability of observing this MCE (or larger) if the model were perfectly calibrated across all segments. This is the minimum p-value across all segments.

Note that this p-value is not adjusted for multiple testing, so the Type I error rate (concluding there is miscalibration when there is none) will be higher in practice. We expect any required adjustment to be small because the hypotheses are highly correlated (many segments overlap). We therefore did not apply common corrections such as Bonferroni, as they would be overly conservative and could substantially increase Type II errors (failing to detect miscalibration when it exists).

Returns:

p-value between 0 and 1.

property mde: float

Minimum detectable error on absolute (probability) scale.

The MDE represents the smallest calibration error that would be statistically detectable at approximately 5 sigma significance, given the sample size. Expressed as an absolute probability difference.

Returns:

Minimum detectable error as an absolute probability difference.

property mde_relative: float

Minimum detectable error on relative (prevalence-adjusted percentage) scale.

The smallest calibration error that can be reliably detected given the sample size and variance in the data. Miscalibration smaller than this value may not be statistically significant even if present.

Based on a 5-sigma detection threshold (very high confidence).

Returns:

MDE as percentage of prevalence.

mcgrad.metrics.soft_label_log_loss(y_true, y_pred, sample_weight=None)[source]

Binary cross-entropy loss that supports soft labels in [0, 1].

Equivalent to sklearn’s log_loss for hard binary labels, but also supports continuous labels (e.g. confidence scores) in the [0, 1] interval.

Parameters:
Return type:

float

Returns:

Weighted mean cross-entropy loss.

mcgrad.metrics.wrap_sklearn_metric_func(func)[source]

Wrap an sklearn-style metric function for use with the evaluation framework.

Parameters:

func (Callable[..., float]) – A function with signature (y_true, y_pred, sample_weight=None) -> float.

Return type:

_ScoreFunctionInterface

Returns:

A ScoreFunctionInterface-compatible wrapper.

mcgrad.metrics.wrap_multicalibration_error_metric(categorical_segment_columns=None, numerical_segment_columns=None, max_depth=3, max_values_per_segment_feature=3, min_samples_per_segment=10, max_n_segments=1000, metric_version='mce_relative', outcome_type='classification')[source]

Create a wrapped MulticalibrationError metric for use with the evaluation framework.

Parameters:
  • categorical_segment_columns (list[str] | None) – Columns to use for categorical segmentation.

  • numerical_segment_columns (list[str] | None) – Columns to use for numerical segmentation.

  • max_depth (int) – Maximum depth for segment generation.

  • max_values_per_segment_feature (int) – Max unique values per segment feature.

  • min_samples_per_segment (int) – Minimum samples required per segment.

  • max_n_segments (int | None) – Maximum number of segments to generate.

  • metric_version (str) –

    Which metric to return. Options: - ‘mce_relative’: relative (prevalence-adjusted percentage) scale (default,

    classification only)

    • ’mce’: absolute scale

    • ’mce_sigma’: sigma (z-score) scale

    • ’mce_pvalue’: p-value

    Legacy names are also supported but deprecated: - ‘mce_sigma_scale’ -> ‘mce_sigma’ - ‘mce_absolute’ -> ‘mce’ - ‘p_value’ -> ‘mce_pvalue’

  • outcome_type (str) – Type of outcome variable. Options: ‘classification’ (default) or ‘regression’. When ‘regression’, ‘mce_relative’ is not available.

Return type:

_ScoreFunctionInterface

Returns:

A ScoreFunctionInterface-compatible wrapper.