Understanding Human Preferences: Towards More Personalized Video to Text Generation

Yihan Wu; Ruihua Song; Xu Chen; Hao Jiang; Zhao Cao; Jin Yu

Understanding Human Preferences: Towards More Personalized Video to Text Generation

Yihan Wu, Ruihua Song, Xu Chen, Hao Jiang, Zhao Cao, Jin Yu

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24EveryoneRevisionsBibTeX

Keywords: video to text generation, user preference modeling, video comments dataset, personalized content generation, multimodal interaction

Abstract: While previous video to text models have achieved remarkable successes, they mostly focus on how to understand the video contents in a general sense, but fail to capture the human personalized preferences, which is highly demanded for an engaging multimodal chatbots. Different from user modeling in collaborative filtering, there is no other user behaviors in inference as a real-time video stream is coming. In this paper, we formally define the task of personalized video commenting task and design an end-to-end personalized framework for solving this task. In specific, we argue that the personalization for video comment generation can be reflected in two aspects, that is, (1) for the same video, different users may comment on different clips, and (2) for the same clip, different people may also express various opinions with diverse commentary styles. Motivated by these considerations, we design our framework based on two components. The first one is a clip selector, which is responsible for predicting the clips that the user may comment in the video. The second one is a text generator, which aims to produce the comment based on the above predicted clips and the user's preference. In our framework, these two components are optimized in an end-to-end manner to mutually enhance each other, where we design confidence-aware scheduled sampling and iterative inference strategies to solve the problem that the ground truth clips are absent in the inference phase. As the absence of personalized video to text dataset, we collect and release a new dataset for studying this problem. We conduct extensive experiments to demonstrate the effectiveness of our model.

Track: User Modeling and Recommendation

Submission Guidelines Scope: Yes

Submission Guidelines Blind: Yes

Submission Guidelines Format: Yes

Submission Guidelines Limit: Yes

Submission Guidelines Authorship: Yes

Student Author: Yes

Submission Number: 2492

Loading