Keywords: spoken language model, reasoning, chain-of-thought
Abstract: Spoken Language Models (SLMs) are designed to take speech inputs and produce
spoken responses. However, current SLMs lack the ability to perform an internal,
unspoken thinking process before responding. In contrast, humans typically engage
in complex mental reasoning internally, enabling them to communicate ideas clearly
and concisely. Thus, integrating an unspoken thought process into SLMs is highly
desirable. While naively generating a complete chain-of-thought (CoT) reasoning
before starting to talk can enable thinking for SLMs, this induces additional latency
for the speech response, as the CoT reasoning can be arbitrarily long. To solve
this issue, we propose STITCH, a novel generation method that alternates between
the generation of unspoken reasoning chunks and spoken response chunks. Since
the audio duration of a chunk of spoken response is much longer than the time to
generate the tokens in a chunk of spoken response, we use the remaining free time
to generate the unspoken reasoning tokens. When a chunk of audio is played to the
user, the model continues to generate the next unspoken reasoning chunk, achieving
simultaneous thinking and talking. Remarkably, STITCH matches the latency
of baselines that cannot generate unspoken CoT by design while outperforming
those baselines by 15% on math reasoning datasets; STITCH also performs equally
well on non-reasoning datasets as those baseline models
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5766
Loading