An efficient utterance-level context-aware fusion architecture for large-scale audio-text sentiment analysis
Abstract: The development of next-generation Intelligent Information Systems (IIS), such as large-scale customer service platforms, critically depends on their ability to understand human sentiment from massive, unstructured multimedia data streams. Processing this data in real-time presents a significant computational challenge, necessitating architectures that are both accurate and scalable. The training of such deep models on extensive datasets inherently requires high-performance computing (HPC) resources, while their deployment for low-latency inference at scale relies on parallel processing. To address this need, we propose UL-CAFNet (Utterance-Level Context-Aware Fusion Network). Our framework introduces a Utterance-Level Contextual Attention module that synergistically integrates attention mechanisms with convolutional layers, enabling the simultaneous modeling of global context and local structural patterns. Furthermore, we develop a deep cross-modal fusion mechanism with multi-layer iterative refinement, which is designed to facilitate robust alignment between audio and text representations. Experiments on the CMU-MOSI and CMU-MOSEI datasets show that our framework achieves a new state-of-the-art performance. The effectiveness of our design is further validated by comprehensive ablation studies and robustness analyses.
External IDs:dblp:journals/tjs/WeiC25
Loading