Abstract: Multi-modal multi-party dialogue understanding is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. In this paper, we pay attention to an important prerequisite of knowing whom is speaking for better understanding multi-modal multi-party dialogues, and thus propose this new format of task: Multi-modal Multi-party Speaker Identification (MMSI), where the system is required to identify the speaker of each utterance given the dialogue contents and corresponding visual context within a session. We construct Friends-MMSI, the first dataset of MMSI, which contains 24,000+ unique utterances annotated with speakers and faces in corresponding frames collected from TV Series Friends. We also propose a simple yet effective baseline method for MMSI, with results indicating that our proposed task and benchmark are still challenging, and we provide insightful knowledge to well understand this task. The code and dataset will be publicly available.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading