Who, What and Where: Composite-semantic Instance Search for Story Videos

Jiahao Guo, Chao Liang, Zhongyuan Wang

Published: 2023, Last Modified: 05 Nov 2023ICME 2023Readers: Everyone

Abstract: This paper studies Who-What-Where (3W) composite-semantic video instance search (INS) problem, which aims to find a specific person doing a queried action in a particular place. Mainstream approaches adopt a complete decomposition strategy, which divides a composite-semantic query into multiple single-semantic queries. However, due to the lack of necessary correlation analysis among constituent semantics, these methods cannot always generate identity-matching and semantics-consistent 3W INS results. To address the above challenges, we propose a partial decomposition scheme with action as the link. Specifically, we selectively split the 3W INS as person-action INS and action-location INS. The former ensures the retrieved person and action share the same identity by modeling their relative spatial positions at the frame level, while the latter improves the semantic consistency between action and location with a cross-semantic attention mechanism at the shot level. Particularly, we build a large-scale 3W INS dataset, containing over 470k video shots, on basis of NIST TRECVID 2016-2021 INS tasks and verify the effectiveness of the proposed method with both quantitative and qualitative experiments.

0 Replies