Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese
Keywords: Text-to-Speech, Audio, Human-likeness Evaluation, Large Language Model, Turing Test
TL;DR: We introduce the Audio Turing Test (ATT), an evaluation framework including a multi-dimensional Chinese corpus ATT-Corpus with an effective, Turing-Test-inspired evaluation protocol.
Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance.
Yet evaluation still relies largely on the Mean Opinion Score (MOS), whose subjectivity, environmental variability, and limited interpretability prevent it from faithfully capturing how human-like the synthesized audio is.
Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation.
To address these challenges, we introduce the **A**udio **T**uring **T**est (ATT), a multi-dimensional Chinese corpus dataset ATT-Cropus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness.
To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation.
Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design.
Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 1983
Loading