Speaker Diarization as Sequence Generation with Speech Language Models

Speaker Diarization as Sequence Generation with Speech Language Models

ACL ARR 2026 January Submission6122 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speaker diarization; speech language modeling; large language modeling

Abstract: Recent advances in Speech Language Models (SpeechLMs), which integrate large language models with speech foundation models, have enabled unified sequence modeling of speech processing tasks. However, most LLM-based approaches to speaker diarization are tightly coupled with automatic speech recognition (ASR) and evaluated using word-level metrics, making it difficult to assess diarization behavior independent of ASR accuracy. In this work, we investigate SpeechLM as a standalone sequence modeling backbone for speaker diarization, formulating diarization as a sequence prediction problem conditioned on acoustic input. We systematically compare two output representations: an event-based representation that explicitly models speaker turn onset and offset timestamps, and a frame-based representation that predicts frame-level speaker activity. To provide structured conversational cues, we further incorporate auxiliary tasks including speech activity detection, overlapped speech detection, and speaker turn counting within the output sequence. Across multiple conversational datasets, we find that event-based representations yield more robust and consistent diarization behavior than frame-based alternatives, particularly for long-form recordings. Our results offer insight into how LLM-style architectures internalize speaker-related structure from acoustic signals, and achieve competitive diarization performance on multiple long-form datasets.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: speech technologies, spoken language understanding

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, Chinese

Submission Number: 6122

Loading