Stability-Aware Post-Training Cascade of Experts for Compute-Efficient Inference

ICLR 2026 Conference Submission17196 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: post-training cascade of experts, compute-efficient inference, threshold optimization, tail-set reuse, decision stability, epsilon-Pareto selection
TL;DR: Stability-aware post-training expert cascades cut inference cost at a bounded accuracy drop via epsilon-Pareto selection and recursive threshold-grid search with tail-set reuse.
Abstract: State-of-the-art models achieve high accuracy at the cost of substantial inference compute, hindering deployment on edge devices and under strict latency budgets. To address this, we present a stability-aware post-training cascade-of-experts that operates over a heterogeneous pool of pre-trained models, balancing accuracy, inference cost, and decision stability. Specifically, we address three questions: Which base models to select—from a heterogeneous pool we retain $\epsilon$-competitive candidates by $\epsilon$-Pareto screening the accuracy–compute trade-off, forming the cascade’s candidate set; How to optimize thresholds—we learn stage-wise accept/defer thresholds via a recursive threshold-grid search with optimal tail-set reuse, minimizing expected inference cost subject to a user-set accuracy tolerance; What final cascade and execution order to deploy—we choose them by jointly considering expected inference cost and cross-validated decision stability. In experiments across text, vision, and audio, when the reference single model shows no substantial validation–test discrepancy, the framework delivers large compute reductions at a bounded accuracy drop.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17196
Loading