TAGMO: Temporal Control Audio Generation for Multiple Visual Objects Without Training

Xinyu Zhang, Keyu Fan, Yiran Wang, Yingshan Liang, Jiasheng Lu, Zhicheng Du, Qingyang Shi, Peiwu Qin

Published: 01 Jan 2025, Last Modified: 13 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the great popularity of Sora, video-based audio generation has become indispensable. While numerous video-to-audio generation models have emerged, they frequently face difficulties including semantic incompatibilities and synchronization problems, especially in situations with multiple objects. To address these difficulties, we introduce TAGMO, a novel training-free audio generation method that offers precise time control for multi-object video scenarios. Our approach first employs object detection to obtain the class labels and temporal labels of each object, which are then structured and utilized as control conditions within a latent diffusion model (LDM) to generate multi-object audio. Additionally, we innovatively design a time mask based on the corresponding temporal labels and integrate it into the denoising process of the pre-trained audio generation model to achieve accurate temporal control. Experimental results demonstrate that our method enhances temporal alignment accuracy and semantic consistency. Audio demonstrations are available at https://coco-create.github.io/.

External IDs:dblp:conf/icassp/ZhangFWLLDSQ25