AnyCap: Omni-Modal Captioning with Instruction Alignment

ICLR 2026 Conference Submission17988 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Instruction Alignment, Caption, Omni-Modal
TL;DR: We introduce a unified solution for instruction-aligned omni-modal captioning—including a plug-and-play model (AnyCap), a large-scale dataset (AnyCapData), and a rigorous benchmark (AnyCapEval).
Abstract: We present AnyCap, a plug-and-play framework that brings instruction alignment to omni-modal captioning. Captions offer a unified language interface for multimodal learning, and users increasingly expect instruction-driven control over their content and style. Current caption models lack explicit instruction supervision and are weak at instruction following, while directly tuning them can degrade general language ability. Achieving instruction alignment in an omni-modal setting is harder still, as each modality calls for separate models and custom designs. To address these challenges, AnyCap leverages a residual-correction paradigm that refines uncontrolled captions from existing models to instruction-aligned ones, without re-training base models. By processing multi-modality features in a unified framework, it enables one model to serve images, videos, and audio. To address the lack of instruction-based data, we construct AnyCapData, a large-scale, high-quality corpus spanning three modalities with 28 well-designed instruction types. For evaluation, we address the limitations of current metrics for instruction-oriented captioning by designing AnyCapEval. Its key insight is to decouple evaluation into content and style for fine-grained assessment. Extensive experiments show that on AnyCapEval and diverse public benchmarks, AnyCap consistently improves both caption quality and instruction adherence for both open-source and API-based models. Notably, AnyCap-8B boosts GPT-4o's content scores by 46\% and style scores by 12\%. Our code and models will be made publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17988
Loading