MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

Fei Li

Published: 13 May 2024, Last Modified: 15 Aug 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: With the increasing popularity of live streaming, the interactions from viewers during a live streaming can provide more specific and constructive feedback for both the streamer and platform. In such scenario, the primary and most direct feedback method from the audience is through comments. Thus, mining these live streaming comments to unearth the intentions behind them and, in turn, aiding streamers to enhance their live streaming quality is significant for the well development of live streaming ecosystem. To this end, we introduce the MMLSCU dataset, containing 50,129 intention-annotated comments across multiple modalities (text, images, videos, audios) from eight streaming categories. Using multi-modal pretrained large model and drawing inspiration from the Chain of Thoughts (CoT) concept, we implement an end-to-end model to sequentially perform the following tasks: viewer comment intent detection > intent cause mining > viewer comment explanation > streamer policy suggestion. We employ distinct branches for video and audio to process their respective modalities. After obtaining the video and audio representations, we conduct a multimodal fusion with the comment. This integrated data is then fed into the large language model for training across the four tasks, leveraging the Chain of Thought (CoT) framework. Experimental results indicate that Our intent detection accuracy can reach 73% and other tasks also have effective results. Compared to the models using only text modality, our model employing multi-modal data yields superior outcomes. Moreover, incorporating CoT allows our model to enhance comment interpretation and more precise suggestions for the streamers. Our proposed dataset and model will bring new research attention on multi-modal live streaming comment understanding.