Keywords: Speaker diarization; speech language modeling; large language modeling
Abstract: Recent advances in Speech Language Models (SpeechLMs), which integrate large language models with speech foundation models, have enabled unified sequence modeling of speech processing tasks. However, most LLM-based approaches to speaker diarization are tightly coupled with automatic speech recognition (ASR) and evaluated using word-level metrics, making it difficult to assess diarization behavior independent of ASR accuracy. In this work, we investigate SpeechLM as a standalone sequence modeling backbone for speaker diarization, formulating diarization as a sequence prediction problem conditioned on acoustic input. We systematically compare two output representations: an event-based representation that explicitly models speaker turn onset and offset timestamps, and a frame-based representation that predicts frame-level speaker activity. To provide structured conversational cues, we further incorporate auxiliary tasks including speech activity detection, overlapped speech detection, and speaker turn counting within the output sequence. Across multiple conversational datasets, we find that event-based representations yield more robust and consistent diarization behavior than frame-based alternatives, particularly for long-form recordings. Our results offer insight into how LLM-style architectures internalize speaker-related structure from acoustic signals, and achieve competitive diarization performance on multiple long-form datasets.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: speech technologies, spoken language understanding
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 6122
Loading