Keywords: Third-Party Interruption, Voice Assistant, Spoken Language Model, Spoken Dialogue System
TL;DR: We define Third-Party Interruption (TPI) awareness and develop a corresponding framework, dataset, and benchmark for developing TPI-aware voice assistants.
Abstract: While recent progress in Spoken Language Models (SLMs) has enabled increasingly natural voice-based interactions, they remain vulnerable to third-party interruptions (TPI). To address this challenge, we present a holistic framework for building and evaluating TPI-aware voice assistants. We first introduce TPI-Train, a large-scale dataset of 80K instances spanning 26 realistic interruption scenarios. For evaluation, we introduce TPI-Bench, which includes TPI-Test for measuring response strategies under interruptions and Janus-Test for probing whether models can distinguish true multi-speaker utterances from acoustically single-speaker yet textually misleading speech. To ensure reproducible and interpretable assessments, we also design two complementary metrics: Response Strategy Following (RSF) and Overall Helpfulness (OH). Experiments demonstrate that models fine-tuned with our approach achieve robust performance on TPI-Bench while preserving general dialogue capabilities on VoiceBench, effectively avoiding reliance on textual shortcuts. Human evaluations further confirm that both our dataset and trained models align with human preferences, establishing the first comprehensive solution for TPI-aware voice assistants. Our dataset will be publicly available, Demo samples: https://tpi-va.github.io/.
Primary Area: datasets and benchmarks
Submission Number: 23833
Loading