Abstract: With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code and models will be public soon.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Multimodal Fusion, [Experience] Multimedia Applications, [Content] Vision and Language
Relevance To Conference: This work contributes to multimedia and multimodal processing by addressing the challenging task of multi-modal Livestreaming Product Retrieval (LPR) through a novel one-stage spatiotemporal graphing network. By integrating modality-level video-image-text embeddings, instance-level similarity mining, and frame-level graph learning, the proposed method enhances fine-grained discrimination in identifying products showcased in livestreaming videos. It effectively mitigates the heterogeneity between videos and images, a common challenge in multimodal processing. Furthermore, the model adeptly tracks spatial deformations and accurately locates intended products, even in complex livestreaming environments. Leveraging textual information from live ASR transcripts and product titles, it overcomes interference from cluttered backgrounds, demonstrating the synergy between visual and textual modalities. Through extensive experiments, the model showcases its efficacy in meeting the demands of fine-grained attention and global spatiotemporal awareness, contributing to advancements in multimodal processing for real-world applications. All models and code will be open-sourced soon.
Supplementary Material: zip
Submission Number: 97
Loading