"Who experiences large model decay and why?" A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

Harvineet Singh; Fan Xia; Alexej Gossmann; Andrew Chuang; Julian C. Hong; Jean Feng

"Who experiences large model decay and why?" A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

Harvineet Singh, Fan Xia, Alexej Gossmann, Andrew Chuang, Julian C. Hong, Jean Feng

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: The paper introduces a hierarchical hypothesis testing method that identifies and explains heterogeneous model performance decay across subgroups, enabling targeted corrective actions to mitigate performance degradation.

Abstract: Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designing *targeted* corrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain how *average* performance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce a **S**ubgroup-scanning **H**ierarchical **I**nference **F**ramework for performance drif**T** (SHIFT). SHIFT first asks "Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?" (*Where?*) and, if so, dives deeper to ask "Can we explain this using more detailed variable(subset)-specific shifts?" (*How?*). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.

Lay Summary: When performance of an ML algorithm drops in a new application context, which subgroups are adversely affected and why? Answering this question is critical to catch hidden failures of the algorithm and to develop targeted fixes for the affected subgroups. Our paper presents a framework that identifies subgroups that experience overly large performance drops and explains the reasons for the drops. We present interpretable summaries of the subgroup along with how uncertain we are given the data that we have so that the users can take informed decisions on how to mitigate the drops. To help ML practitioners inspect their algorithms, we have released a free software tool called SHIFT which only requires black-box access to algorithm outputs. This can suggest targeted fixes to improve the algorithms for the affected subgroups without sacrificing its performance elsewhere, and hence ensure that the algorithms are safe to transfer across application contexts.

Link To Code: https://github.com/jjfeng/shift

Primary Area: Social Aspects->Robustness

Keywords: Distribution shift, algorithmic fairness, interpretability, explanation, decomposition, mediation analysis

Submission Number: 7917

Loading