V2A-CoT: A Training-Free Video-to-Audio Method via Expert Chain-of-Thought

Xixi Zheng; ZiHao Xu; Dawei xu; Chuan Zhang

V2A-CoT: A Training-Free Video-to-Audio Method via Expert Chain-of-Thought

Xixi Zheng, ZiHao Xu, Dawei xu, Chuan Zhang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain-of-Thought, video-to-audio, multimodal large language models

TL;DR: We propose V2A-CoT, a training-free video-to-audio framework based on video understanding via expert CoT.

Abstract: The rapid development of multimodal models, such as Sora, has given rise to video generation. To make the generated video more realistic, the sound is expected to be generated along with the video. Existing video-to-audio (V2A) works, however, most rely on training, which incurs tremendous computing costs, making them unsuitable for general scenes. In this paper, we propose V2A-CoT, a zero-shot Chain-of-Thought (CoT) based on multimodal large language models (MLLMs). V2A-CoT understands the video from both macro and micro perspectives and generates corresponding audio. Specifically, V2A-CoT guides the MLLM in capturing object data from the video through a carefully designed expert CoT, predicting its sound attributes from a macro perspective across the entire video. By referencing this data, the MLLM enriches the sound details from multiple perspectives. Then, to predict the dynamic changes in sound, V2A-CoT captures the relative position of the sound source from a micro perspective, ultimately generating the audio using a pre-trained model. Compared to state-of-the-art V2A methods: (i) V2A-CoT is the first training-free V2A method, which enhances video understanding from macro and micro perspectives; (ii) comprehensive evaluations demonstrate that V2A-CoT enhances MLLM’s audiovisual understanding ability, improving video understanding accuracy by up to 14\%, and the audiovisual consistency of generated audio is nearly doubled; (iii) V2A-CoT excels in finely controlled sound tasks and adapts seamlessly to any MLLM in a plug-and-play manner.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12018

Loading