Keywords: adaptive inference, efficient inference, model cascading, cascading for generative tasks
TL;DR: Semantic signals are reliable training-free deferral metrics for generative LLM cascades.
Abstract: Existing cascade systems struggle with open-ended text generation due to evaluation challenges where multiple valid outputs exist without ground truth references. We propose using semantic agreement between multiple model outputs as a training-free deferral signal and evaluate semantic similarity metrics against token-level confidence across translation, summarization, question answering, and reading comprehension tasks. We show that semantic signals provide a stronger indication of when deferral is appropriate than token-level methods and are resilient to heterogeneous model quality.
Submission Number: 124
Loading