Abstract: Dialogue-to-video retrieval is an interesting while challenging task, which exploits an AI agent to retrieve the video that matches and aligns with the conversational context between users. In particular, given a history of dialogue exchanges, the agent is expected to identify the most fitting video content that complements the ongoing conversation. The computational complexity posed by the processing of videos within deep neural networks encourages us to adapt CLIP, a cutting-edge large multi-modal models (LLMs), into the dialogue-video domain. On this basis, we propose a multi-grained attention network (MGAT), integrating query-scoring, dual-softmax and query-bank normalization techniques. We design a multi-grained attention module to optimize the inadequate modeling of conversation semantics in existing pre-trained models, dynamically assign weights during conversation feature extraction, and introduce conversation context features in multimodal alignment. Most importantly, fine-grained similarity by per-round query-frame scoring and coarse-grained similarity by high-level semantics of all rounds of dialogues for each video are respectively calculated, which are further learned via multi-grained attention mechanism. This approach effectively transfers CLIP’s text-image multimodal knowledge into the dialogue-video retrieval system, alleviating the need for resource-intensive and costly fine-tuning dialogue-video procedures. Extensive experiments on multiple datasets demonstrate that our method outperforms state-of-the-art approaches.
External IDs:dblp:journals/access/YuZTL25
Loading