Abstract: The Who-What-Where (3W) composite-semantics video Instance Search (INS) task aims to find video shots about a person doing an action in a location. The state-of-the-art (SOTA) methods decompose 3W INS into three 2W INS, i.e., who-what, what-where and where-who semantic correlation modeling, and directly multiply three 2W INS results to produce the final 3W INS result. Obviously, overlapping semantics exist among the above 2Ws, e.g., who-what and what-where share the action component. The semantic overlap indicates that the 2Ws are mutually interdependent rather than independent. According to probability theory, the product of interdependent variables cannot be directly multiplied to obtain an accurate result, and such a direct product would yield a suboptimal outcome. This interdependence exerts diverse influences on the 3W INS results. For instance, fusing two 2W INS results ''Dr. Kelleher-provide medical guidance'' and ''provide medical guidance-in the hospital'', ''provide medical guidance'' is a pivotal connection, of positively enhancing the rationality of both person and location. Conversely, while both ''Ross-lifts heavy objects'' and ''lift heavy objects-Ross'' are individually coherent, combining them by overlapping the shared element ''Ross'' creates a conflict between the hazardous setting and strenuous labor, ultimately undermining the overall plausibility. Inspired by quantum interference theory, we propose a Quantum Interference Partial Decomposition (QIPD) method to model the diverse influences of semantic overlap from 2W to 3W INS. Specifically, QIPD incorporates two core modules, i.e., semantic interference and temporal interference. The former derives the 3W amplitude by converting 2W samples into amplitudes and phases and performing interference, while the latter sets the current shot's phase as baseline, amplifying the influence of adjacent shots while attenuating distant shots. Extensive evaluations on three large-scale 3W INS datasets demonstrate that QIPD outperforms SOTA baselines.
External IDs:doi:10.1145/3746027.3755325
Loading