MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

Zixiang Meng; Qiang Gao; Di Guo; Yunlong Li; Bobo Li; Hao Fei; Shengqiong Wu; Fei Li; Chong Teng; Donghong Ji

MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

Zixiang Meng, Qiang Gao, Di Guo, Yunlong Li, Bobo Li, Hao Fei, Shengqiong Wu, Fei Li, Chong Teng, Donghong Ji

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24 OralEveryoneRevisionsBibTeX

Keywords: Live Streaming, Multi-modal, Dataset, Comment Understanding, Chain of Thought

TL;DR: We introduce the MMLSCU dataset with 50,129 multi-modal, annotated comments from live streaming, and using our dataset and the CoT framework, we enhance intent detection and provide insightful recommendations for streamers.

Abstract: With the increasing popularity of live streaming, the interactions from viewers during a live streaming can provide more specific and constructive feedback for both the streamer and platform. In such scenario, the primary and most direct feedback method from the audience is through comments. Thus, mining these live streaming comments to unearth the intentions behind them and, in turn, aiding streamers to enhance their live streaming quality is significant for the well development of live streaming ecosystem. To this end, we introduce the MMLSCU dataset, containing 50,129 intention-annotated comments across multiple modalities (text, images, videos, audios) from eight streaming categories. Using multi-modal pretrained large model and drawing inspiration from the Chain of Thoughts (CoT) concept, we implement an end-to-end model to sequentially perform the following tasks: viewer comment intent detection > intent cause mining > viewer comment explanation > streamer policy suggestion. We employ distinct branches for video and audio to process their respective modalities. After obtaining the video and audio representations, we conduct a multimodal fusion with the comment. This integrated data is then fed into the large language model for training across the four tasks, leveraging the Chain of Thought (CoT) framework. Experimental results indicate that Our intent detection accuracy can reach 73\% and other tasks also have effective results. Compared to the models using only text modality, our model employing multi-modal data yields superior outcomes. Moreover, incorporating CoT allows our model to enhance comment interpretation and more precise suggestions for the streamers. Our proposed dataset and model will bring new research attention on multi-modal live streaming comment understanding.

Track: Web Mining and Content Analysis

Submission Guidelines Scope: Yes

Submission Guidelines Blind: Yes

Submission Guidelines Format: Yes

Submission Guidelines Limit: Yes

Submission Guidelines Authorship: Yes

Student Author: Yes

Submission Number: 2166

Loading