Text-Guided Video Amodal Completion

Text-Guided Video Amodal Completion

02 Apr 2025 (modified: 15 Jul 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Amodal perception enables humans to perceive entire objects even when parts are occluded, a remarkable cognitive skill that artificial intelligence struggles to replicate. While substantial advancements have been made in image amodal completion, video amodal completion remains underexplored despite its high potential for real-world applications in video editing and analysis. In response, we propose a video amodal completion framework to explore this potential direction. Our contributions include (i) a synthetic dataset for video amodal completion with text description for the object of interest. The dataset captures a variety of object types, textures, motions, and scenarios to support zero-shot transferring on natural videos. (ii) A diffusion-based text-guided video amodal completion framework enhanced with a motion continuity module to ensure temporal consistency across frames. (iii) Zero-shot inference for long video, inspired by temporal diffusion techniques to effectively manage long video sequences while improving inference accuracy and maintaining coherent amodal completions. Experimental results shows the efficacy of our approach in handling video amodal completion, opening potential capabilities for advanced video editing and analysis with amodal completion.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: Here, we summarize the major changes made in the revised manuscript as follow: - We have added the following statistics: average object size, average occlusion rate, and average occlusion-coverage rate, to Table 1 to address the comments raised by reviewer FHRB. - We have revised Figure 4 (Page 9) and Figure 5 (Page 10) in the manuscript to include clips where the target objects exhibit more dynamic motion to address the comments raised by reviewer FHRB and reviewer JU1P. - We have added additional ablation study to Table 3 (Page 12) regarding our TGVAC setting text-prompt + motion training to address comments raised by reviewer JU1P. - We have added section 4.5. Comparison with video amodal completion methods (Page 12,13) to address comments raised by reviewer FHRB. - We have added section 4.6. Comparison with text-to-video diffusion pretrained to address comments raised by reviewer DB2i. - We have uploaded several videos of TGVAC on natural videos as supplementary materials to address comments raised by reviewer DB2i. - We have updated our revised manuscript with \citep to address comments raised by reviewer DB2i.

Assigned Action Editor: ~Sungwoong_Kim2

Submission Number: 4606

Loading