Towards context and domain-aware algorithms for scene analysis

TMLR Paper2507 Authors

11 Apr 2024 (modified: 29 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Interpersonal interactions and social situations in multimedia content encompass a rich blend of visual, textual, audio and contextual cues as well. However, contextual data integration in multimodal scene analysis research has often been overlooked, leading to incomplete interpretations. For instance, recognizing that two combatants in a video are positioned within a designated ring with a dedicated referee drastically alters the perception from a simple scuffle to a structured martial arts contest. This paper presents an innovative approach to scene analysis in video content, which not only incorporates contextual data but also emphasizes the most significant features during training. Additionally, we introduce a methodology for integrating domain knowledge into our framework. We evaluate our proposed methodology using two comprehensive datasets, demonstrating promising results compared to a baseline study using one of the datasets. These findings underscore the importance of integrating contextual data into multimodal video analysis, while also recognizing the challenges associated with their utilization.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Authors are registered on OpenReview.
Assigned Action Editor: ~Ozan_Sener1
Submission Number: 2507
Loading