Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

ICLR 2026 Conference Submission1983 Authors

04 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-Speech, Audio, Human-likeness Evaluation, Large Language Model, Turing Test

TL;DR: We introduce the Audio Turing Test (ATT), an evaluation framework including a multi-dimensional Chinese corpus ATT-Corpus with an effective, Turing-Test-inspired evaluation protocol.

Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Yet evaluation still relies largely on the Mean Opinion Score (MOS), whose subjectivity, environmental variability, and limited interpretability prevent it from faithfully capturing how human-like the synthesized audio is. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the **A**udio **T**uring **T**est (ATT), a multi-dimensional Chinese corpus dataset ATT-Cropus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 1983

Loading