VoxKrikri: Unifying Speech and Language through Continuous Fusion

Dimitrios Damianos; Leon Voukoutis; Georgios Paraskevopoulos; Vassilis Katsouros

VoxKrikri: Unifying Speech and Language through Continuous Fusion

Dimitrios Damianos, Leon Voukoutis, Georgios Paraskevopoulos, Vassilis Katsouros

Published: 02 Jun 2026, Last Modified: 21 Jun 2026Greeks in AI 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech LLMs, modality fusion, continuous latent space, causal masking, ASR

Domains: Language and Learning

TL;DR: VoxKrikri is the first Greek speech LLM, fusing a Whisper decoder with a text LLM via cross-modal attention in a continuous space. Supporting streaming and offline modes, it hits state-of-the-art Greek ASR with a 20% relative improvement.

External Link: https://arxiv.org/abs/2509.15667

Abstract: We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20\%$ relative improvement across benchmarks.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 193

Loading