LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Zirun Guo; Feng Zhang; Kai Jia; Tao Jin

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Zirun Guo, Feng Zhang, Kai Jia, Tao Jin

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tool use of LLMs, Reinforcement Learning, Interleaved Generation

TL;DR: A flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem

Abstract: We propose LLM-Interleaved (**LLM-I**), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3755

Loading