Leveraging Text Representation and Face-head Tracking for Long-form Multimodal Semantic Relation Understanding
Abstract: In the intricate problem of understanding long-form multi-modal inputs, few key-aspects in scene-understanding and dialogue-and-discourse are often overlooked. In this paper, we investigate two such key-aspects for better semantic and relational understanding - (i). head-object-tracking in addition to usual face-tracking, and (ii). fusing scene-to-text representation with external common-sense knowledge-base for effective mapping to sub-tasks of interest. The usage of head-tracking especially helps with enriching sparse entity mapping to inter-entity conversation interactions. These methods are guided by natural language supervision on visual models, and perform well for interaction and sentiment understanding tasks.
0 Replies
Loading