Directly Locating Actions in Video with Single Frame Annotation

Haoran Tong, Xinyan Liu, Guorong Li, Laiyun Qing

Published: 01 Jan 2024, Last Modified: 11 Apr 2025ICMR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose a novel method for point-supervised action localization.Differs from the common practice of locating actions by first categorizing each video frame, our method directly predicts actions' positions and length. Specifically, point-supervised action localization is achieved by a series of fully supervised action location iteratively. In each iteration, the input video are used as input tokens and fed into a transformer, where the encoder extracts global context of the clips, and the decoder generates queries containing information for action localization. Three MLP heads are built on each query to obtain the probability, the center, and the length of each action instance respectively. Experiments on three popular datasets prove the potential of our method.