Ex Pede Herculem, Predicting Global Actionness Curve from Local Clips

Xu Chen, Yang Li, Yahong Han, Jialie Shen

Published: 26 Oct 2025, Last Modified: 06 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Dense multi-label action detection in untrimmed long videos is a formidable task, with end-to-end training particularly challenging due to computational constraints, typically involving separate stages of off-the-shelf feature extraction and subsequent global modeling for action prediction. Existing methods fail to optimize all modules jointly for better performance. We introduce FreETAD, a Frequency-based End-to-end Temporal Action Detection approach, which shifts the focus from local actionness scores to frequency component estimation. Using the short-term Fourier Transform, FreETAD reconstructs the global action curve seamlessly. With a DETR-like decoder and frequency-encoded vectors for queries, it enhances multi-scale time-frequency interactions. FreETAD leverages end-to-end training effectively, boosting the mAP by 1.5% on Charades and 2.7% on MultiTHUMOS.