Keywords: post-training cascade of experts, compute-efficient inference, threshold optimization, tail-set reuse, decision stability, epsilon-Pareto selection
TL;DR: Stability-aware post-training expert cascades cut inference cost at a bounded accuracy drop via epsilon-Pareto selection and recursive threshold-grid search with tail-set reuse.
Abstract: State-of-the-art models achieve high accuracy at the cost of substantial inference compute, hindering deployment on edge devices and under strict latency budgets.
To address this, we present a stability-aware post-training cascade-of-experts that operates over a heterogeneous pool of pre-trained models, balancing accuracy, inference cost, and decision stability.
Specifically, we address three questions:
Which base models to select—from a heterogeneous pool we retain $\epsilon$-competitive candidates by $\epsilon$-Pareto screening the accuracy–compute trade-off, forming the cascade’s candidate set;
How to optimize thresholds—we learn stage-wise accept/defer thresholds via a recursive threshold-grid search with optimal tail-set reuse, minimizing expected inference cost subject to a user-set accuracy tolerance;
What final cascade and execution order to deploy—we choose them by jointly considering expected inference cost and cross-validated decision stability.
In experiments across text, vision, and audio, when the reference single model shows no substantial validation–test discrepancy, the framework delivers large compute reductions at a bounded accuracy drop.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17196
Loading