Quantifying the Modality Gap Between Speech and Text with Generative Perplexity Under Controlled Data and Objective Settings
Keywords: Speech LMs, modality gap.
Abstract: Text language models (LMs) demonstrate strong long-range coherence, whereas textless speech LMs often exhibit weaker global consistency. This performance gap is commonly attributed to differences in information density between text tokens and discretized speech units. However, speech and text models also differ substantially in training data scale, distribution, and objective design, making it unclear whether the gap arises from intrinsic modality properties or training conditions.
In this work, we quantify the modality gap using a unified evaluation framework based on generative perplexity (genPPL). Specifically, we first generate samples from a trained model and then evaluate the semantic quality of the generated content using an oracle text LM, allowing cross-modal comparison under a shared metric.
We conduct a controlled study along three dimensions:
(1) data distribution — general web text synthesized to speech (C4-TTS), natural conversational speech (Emilia), and audiobook speech (MLS English);
(2) training modality — text-only, speech-only, and an inner-monologue speech–text objective;
(3) consistent model capacity and evaluation protocol.
Our results isolate how much of the modality gap can be explained by data distribution and training objective, rather than modality alone. These findings provide a clearer understanding of where speech language modeling falls short and suggest directions for closing the gap.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 65
Loading