Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Siqi Kou; Jiachun Jin; Zhihong Liu; Chang Liu; Ye Ma; Jian Jia; Quan Chen; Peng Jiang; Zhijie Deng

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Orthus is a unified multimodal model for interleaved image-text understanding and generation. It generates content across modalities by routing the outputs from its shared transformer backbone to modality-specific heads.

Abstract: We introduce Orthus, a unified multimodal model that excels in generating interleaved images and text from mixed-modality inputs by simultaneously handling discrete text tokens and continuous image features under the \textbf{AR} modeling principle. The continuous treatment of visual signals minimizes the information loss while the fully AR formulation renders the characterization of the correlation between modalities straightforward. Orthus leverages these advantages through its modality-specific heads---one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features. We devise an efficient strategy for building Orthus---by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within 72 A100 GPU hours). Orthus-base can further embrace post-training to craft lengthy interleaved image-text, reflecting the potential for handling intricate real-world tasks. For visual understanding and generation, Orthus achieves a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters, outperforming competing baselines including Show-o and Chameleon.

Lay Summary: Today’s AI models often need to understand both pictures and words — like describing an image, generating a picture from a sentence, or even combining the two in a creative way. But making one model that handles both smoothly is hard because images and text are different. Existing solutions either lose information when processing images or struggle to connect the two types of data well. We introduce a unified multimodal model called Orthus that learns to handle both text and images together, without forcing images into lossy representations. It uses one regular language modeling head to deal with discrete text and another diffusion head to generate continuous images, but both parts work together within a single transformer. Our model can understand and create mixed content like storybooks with interleaved images and text. We hope Orthus will make future multimodal models more creative, flexible, and useful for real-world tasks.

Primary Area: Deep Learning->Foundation Models

Keywords: multimodal foundation model, interleaved image-text generation, diffusion

Submission Number: 10719

Loading