Leveraging Text Representation and Face-head Tracking for Long-form Multimodal Semantic Relation Understanding

Raksha Ramesh, Vishal Anand, Zifan Chen, Yifei Dong, Yun Chen, Ching-Yung Lin

Published: 01 Jan 2022, Last Modified: 15 May 2023ACM Multimedia 2022Readers: Everyone

Abstract: In the intricate problem of understanding long-form multi-modal inputs, few key-aspects in scene-understanding and dialogue-and-discourse are often overlooked. In this paper, we investigate two such key-aspects for better semantic and relational understanding - (i). head-object-tracking in addition to usual face-tracking, and (ii). fusing scene-to-text representation with external common-sense knowledge-base for effective mapping to sub-tasks of interest. The usage of head-tracking especially helps with enriching sparse entity mapping to inter-entity conversation interactions. These methods are guided by natural language supervision on visual models, and perform well for interaction and sentiment understanding tasks.

0 Replies