SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

ACL ARR 2025 May Submission6484 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean–squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that thinks in the same continuous SONAR embedding space yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 100 M to 900 M parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, and benchmark results and, release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: pre-training, scaling, efficient models

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6484

Loading