Keywords: Omni model, Multimodal large language model, Detailed captioning, Audio understanding, Video understanding, Benchmark, Evaluation
TL;DR: In this work, we presented a complete framework for advancing omni detailed perception from three perspectives: Data, Model, and Benchmark.
Abstract: Fine-grained perception of multimodal information is critical for advancing human–AI interaction. With recent progress in audio–visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and accurately describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent ``co-growth'' between the level of detail and the degree of hallucination in current OLMs. To address this, we propose \textbf{Omni-Detective}, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: \textbf{Audio-Captioner} for audio-only detailed perception, and \textbf{Omni-Captioner} for audio–visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design \textbf{Omni-Cloze}, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority and human preference alignment of Omni-Cloze in evaluating such detailed captions. All the data pipeline, models, and the benchmark are open-source to facilitate further research for omni detailed perception.\footnote{\url{https://github.com/ddlBoJack/Omni-Captioner}}
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10454
Loading