Abstract: Highlights•Language features compared to video features tends to be sparser.•Verbs carry discriminative information for distinguishing different videos.•Salient action and temporal order of actions are two key factors.•A Prompt Exploration module is designed to reduce the feature sparsity.•An Action Temporal Prediction module is introduced to enhance temporal awareness.
Loading