Abstract: As one knows, an event often consists of several actions while each action is atomic. Inspired by this insight, we propose a novel framework named Atomic-action-based Contrastive Network model (ACN) for weakly supervised temporal language grounding task to localize the query-related event moment in an untrimmed video, without access to any temporal annotations. Specifically, ACN first determines the accurate moment boundary of each action in a query-agnostic way. This can adequately exploit homogeneous visual cues while impeding the heterogeneity of the query from hurting the atomicity of visual action, i.e., action boundary. To effectively localize the query-related event, we seek the discriminative words in the given query, and explore a composite-grained contrastive module to retrieve those corresponding atomic actions in the common latent space across modalities. This boosts feature discrimination of visual event segment to remove irrelevant action video segments. Experiments on two popular datasets show the efficacy of our model.
0 Replies
Loading