InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

ACL ARR 2025 May Submission6431 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In modern speech synthesis, paralinguistic information—such as a speaker’s vocal timbre, emotional state, and dynamic prosody—plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many open-source and commercial systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, the first TTS benchmark for measuring the capability of complex natural-language style control. InstructTTSEval includes three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases~(6k total), paired with reference audio. We leverage Google’s Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems reveals that even the best-performing model achieves only modest style-control accuracy, underscoring substantial room for improvement. We anticipate that \textsc{InstructTTSEval} will drive progress toward more powerful, flexible, and accurate instruction-following TTS models.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: Text-to-Speech, benchmarking, speech technologies, speech and vision

Contribution Types: Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 6431

Loading