Seeing Beyond Words: Multimodal Aspect-Level Complaint Detection in Ecommerce Videos

Rishikesh Devanathan; APOORVA SINGH; A.S. Poornash; Sriparna Saha

Seeing Beyond Words: Multimodal Aspect-Level Complaint Detection in Ecommerce Videos

Rishikesh Devanathan, APOORVA SINGH, A.S. Poornash, Sriparna Saha

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Complaints are pivotal expressions within e-commerce communication, yet the intricate nuances of human interaction present formidable challenges for AI agents to grasp comprehensively. While recent attention has been drawn to analyzing complaints within a multimodal context, relying solely on text and images is insufficient for organizations. The true value lies in the ability to pinpoint complaints within the intricate structures of discourse, scrutinizing them at a granular aspect level. Our research delves into the discourse structure of e-commerce video-based product reviews, pioneering a novel task we term Aspect-Level Complaint Detection from Discourse (ACDD). Embedded in a multimodal framework, this task entails identifying aspect categories and assigning complaint/non-complaint labels at a nuanced aspect level. To facilitate this endeavour, we have curated a unique multimodal product review dataset, meticulously annotated at the utterance level with aspect categories and associated complaint labels. To support this undertaking, we introduce a Multimodal Aspect-Aware Complaint Analysis (MAACA) model that incorporates a novel pre-training strategy and a global feature fusion technique across the three modalities. Additionally, the proposed framework leverages a moment retrieval step to identify the relevant portion of the clip, crucial for accurately detecting the fine-grained aspect categories and conducting aspect-level complaint detection. Extensive experiments conducted on the proposed dataset showcase that our framework outperforms unimodal and bimodal baselines, offering valuable insights into the application of video-audio-text representation learning frameworks for downstream tasks.

Primary Subject Area: [Experience] Multimedia Applications

Secondary Subject Area: [Content] Vision and Language

Relevance To Conference: Our work on multimodal aspect-level complaint detection in ecommerce videos contributes significantly to the field of multimedia/multimodal processing, which is highly relevant to this conference. By introducing a manually annotated video dataset with aspect categories and associated complaint/non-complaint labels, we provide a valuable resource for researchers and practitioners interested in understanding and addressing consumer grievances in the digital realm. Additionally, our proposed framework is capable of handling video, text, and audio modalities simultaneous, represents a novel advancement in multimodal processing. The moment retrieval steps integrated into our model enhance its efficiency in identifying relevant clips for accurately classifying aspect categories and complaint/non-complaint labels. This interdisciplinary approach not only extends the boundaries of multimedia research but also holds practical implications for improving customer satisfaction and experience in ecommerce settings.

Supplementary Material: zip

Submission Number: 4906

Loading