LLMs can see and hear without any training

Kumar Ashutosh; Yossi Gandelsman; Xinlei Chen; Ishan Misra; Rohit Girdhar

LLMs can see and hear without any training

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Text-only LLMs can perform multimodal tasks (eg captioning, text-to-image, stylization) without any training! LLM generates candidates that are scored by an off-the-shelf model, and fed back iteratively, eventually producing the final output.

Abstract: We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Lay Summary: This work introduces MILS, a method that allows large language models (LLMs) to interpret images, videos, and audio—without any additional training or task-specific data. MILS connects an LLM with a scoring model that evaluates how well each proposed caption or description matches a given input, such as an image or audio clip. The LLM generates multiple candidate responses, receives feedback from the scorer, and refines its outputs iteratively. MILS demonstrates strong zero-shot performance across a wide range of tasks: captioning visual and audio inputs, enhancing text-to-image generation, performing style transfer, and even combining information across modalities. It does all this using only pre-trained models and test-time reasoning, avoiding any fine-tuning or supervised training. By leveraging the native reasoning ability of LLMs and the representational power of multimodal models, MILS showcases that powerful multimodal understanding and generation can emerge without explicit supervision. Its simplicity and flexibility open new possibilities for building general-purpose AI systems that operate across modalities.

Link To Code: https://github.com/facebookresearch/MILS/

Primary Area: Deep Learning->Large Language Models

Keywords: large language models, LLMs, multimodal, reasoning, captioning, image generation, test-time optimization, gradient-free optimization

Submission Number: 7422

Loading