Feature Contributions to Multimodal Interpretation of Common Ground

Ibrahim Khebour, Changsoo Jung, Jack Fitzgerald, Huma Jamil, Nikhil Krishnaswamy

Published: 01 Jan 2025, Last Modified: 18 Jul 2025HCI (6) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models are excellent at processing and extracting semantic information from text. However, to understand the meaning of a real-world interaction, we often need to integrate additional modalities, like gestures, body language, and other non-verbal cues. Here we explore the difficulties that arise with the integration of real-time multimodal processing in AI systems, we also emphasize the disparity between human communication, which seamlessly incorporates multiple modalities, and the current limitations of AI. In this paper, we examine existing works, identify their weaknesses, and propose novel methods that aim to enhance the real-time integration of multimodal data. The results we present indicate that improving AI systems’ ability to process multimodal information can lead to apparent advancements in their comprehension capabilities with dynamic and situated environments.