Action Dubber: Timing Audible Actions via Inflectional Flow

Wenlong Wan; Weiying Zheng; Tianyi Xiang; Guiqing Li; Shengfeng He

Action Dubber: Timing Audible Actions via Inflectional Flow

Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He

Published: 01 May 2025, Last Modified: 24 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A new method and dataset utilizing inflectional flow to address Audible Action Temporal Localization problem.

Abstract: We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan/Audible623.

Lay Summary: This work focuses on accurately identifying the exact moments in a video when a visible action is likely to produce a sound, such as an object hitting the ground, without relying on audio. The task is called Audible Action Temporal Localization. We introduce a model named $TA^{2}Net$ that captures sudden visual changes often linked to sound. To support this task, we create a new dataset called $Audible623$ by removing non-essential vocal parts from existing videos. The approach reduces manual effort in video dubbing and also generalizes well to tasks like counting repeated actions.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/WenlongWan/Audible623

Primary Area: Applications->Computer Vision

Keywords: Video Understanding, Temporal Localization

Submission Number: 154

Loading