Abstract: The rise of synthetic speech audio-based NLP tasks has raised critical questions about the robustness, fidelity, and fairness. This study will empirically examine the relationship between Text-to-Speech (TTS) and Speech-to-Text (STT) models using hate and non-hate speech data. Our evaluation focuses on three key dimensions: (1) STT robustness, assessing the accuracy and gender sensitivity of STT models when transcribing synthetic versus human audio; (2) TTS synthetic audio fidelity, examining human-likeness and model preference through annotator evaluations and processing speed analysis; and (3) Impact on hate speech classification, quantifying how STT and TTS combinations affect downstream toxicity predictions. Our findings show that synthetic audio, especially from Microsoft Edge TTS, outperforms human audio in both transcription accuracy and consistency. WhisperX-Align (extended based on OpenAI’s Whisper model) emerges as the most robust STT model across tasks, although some systems exhibit notable gender and domain-specific biases. We recommend Microsoft Edge TTS as a high fidelity benchmark and SpeechT5 as a human proxy for perceptual evaluation, while highlighting the need for bias aware deployment in sensitive applications, such as hate speech detection. The implementation code is publicly available at https://anonymous.4open.science/r/Can-AI-Replace-Human-Speech-D0EF/.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, speech technologies, model bias/fairness evaluation,
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: english
Submission Number: 33
Loading