Abstract: Temporal action localization in untrimmed long videos is an important yet challenging problem. The temporal ambiguity and the intra-class variations of temporal structure of actions make existing methods far from being satisfactory. In this paper, we propose a novel framework which firstly models each action clip based on its temporal evolution, and then adopts a deep multiple instance learning (MIL) network for jointly classifying action clips and refining their temporal boundaries. The proposed network utilizes a MIL scheme to make clip-level decisions based on temporal-instance-level decisions. Besides, a temporal smoothness constraint is introduced into the multi-task loss. We evaluate our framework on THUMOS Challenge 2014 benchmark and the experimental results show that it achieves considerable improvements as compared to the state-of-the-art methods. The performance gain is especially remarkable under precise localization with high tIoU thresholds, e.g. mAP@tIoU=0.5 is improved from 31.0% to 35.0%.
0 Replies
Loading