A Benchmark for Controllable Speaking-Style Captioning in Audio-Language Models

ACL ARR 2026 January Submission6922 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio-language models, speaking style captioning
Abstract: Speaking-style captioning (SSC) aims to generate natural language descriptions of how speech is delivered, capturing paralinguistic attributes such as vocal timbre, prosody, and expressivity. Many downstream applications, including conversational AI agents, controllable speech generation, and large-scale audio annotation, require dimension-specific style captions that describe targeted aspects of speech (e.g., speaker traits, emotion, or delivery style), rather than a single undifferentiated description. However, existing work lacks a unified task formulation that supports controllability over which stylistic dimensions should be described. SSC spans abstraction levels ranging from low-level acoustic traits to broad, context-dependent characterizations, making comparison and evaluation difficult. We address this gap by formulating SSC as an instruction-following audio-language task, where explicit instructions specify the speaking-style dimensions to be described. Based on this formulation, we introduce StyleInstructCaps, the first standardized benchmark for controllable speaking-style captioning in audio-language models. StyleInstructCaps provides a task-specific dataset and an evaluation framework that measures metadata groundedness, hallucination, instruction-following ability, speaker style consistency, and generalization to unseen instructions and audio datasets.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: spoken language understanding, spoken language grounding, spoken dialog
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 6922
Loading