Abstract: Action detection is a significant and challenging task, especially in densely-labelled datasets of untrimmed videos.
Such data consist of complex temporal relations including
composite or co-occurring actions. To detect actions in
these complex settings, it is critical to capture both shortterm and long-term temporal information efficiently. To
this end, we propose a novel ‘ConvTransformer’ network
for action detection: MS-TCT. This network comprises
of three main components: (1) a Temporal Encoder module which explores global and local temporal relations at
multiple temporal resolutions, (2) a Temporal Scale Mixer
module which effectively fuses multi-scale features, creating a unified feature representation, and (3) a Classification
module which learns a center-relative position of each action instance in time, and predicts frame-level classification
scores. Our experimental results on multiple challenging
datasets such as Charades, TSU and MultiTHUMOS, validate the effectiveness of the proposed method, which outperforms the state-of-the-art methods on all three datasets.
0 Replies
Loading