Keywords: Concept Representation, Multimodal Reasoning, Multimodal Intent Recognition
TL;DR: We introduce ConMR, a novel framework that advances multimodal intent recognition through concept-level reasoning, significantly enhancing both discriminative performance and interpretability.
Abstract: Multimodal intent recognition is a fundamental task in understanding human communication, aiming to infer intent from heterogeneous modalities and serving as a cornerstone for developing human-centric systems. However, existing methods face two key challenges. First, they rely on entangled and modality-specific features, which hinder the derivation of interpretable representations across modalities. Second, they lack explicit reasoning mechanisms, making it difficult to capture high-level semantic dependencies and systematically link multimodal evidence to complex intents. To address these issues, we propose a novel method (ConMR) that conducts concept-level multimodal reasoning by jointly learning semantic concept representations and modeling concept relations. Specifically, we first leverage the Large Language Model (LLM) to generate high-quality intent-related concepts, providing explicit semantic anchors beyond shallow features. By supervising multimodal feature mapping through activation alignment, these concepts yield interpretable and discriminative representations. Building on this foundation, the concept-level multimodal reasoning module models concept-to-intent relations through LLM-guided relevance scores and infers inter-concept relations from activation patterns. By jointly exploiting these relations, it guides transparent reasoning paths from concepts to intents, thereby enhancing both accuracy and interpretability. Extensive experiments on two challenging datasets show that ConMR outperforms state-of-the-art methods with superior robustness and interpretability, laying a new paradigm for multimodal intent recognition.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8394
Loading