Calibrating Video Watch-time Predictions with Credible Prototype Alignment

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Accurately predicting user watch-time is crucial for enhancing user stickiness and retention in video recommendation systems. Existing watch-time prediction approaches typically involve transformations of watch-time labels for prediction and subsequent reversal, ignoring both the natural distribution properties of label and the \textit{instance representation confusion} that results in inaccurate predictions. In this paper, we propose ProWTP, a two-stage method combining prototype learning and optimal transport for watch-time regression prediction, suitable for any deep recommendation model. Specifically, we observe that the watch-ratio (the ratio of watch-time to video duration) within the same duration bucket exhibits a multimodal distribution. To facilitate incorporation into models, we use a hierarchical vector quantised variational autoencoder (HVQ-VAE) to convert the continuous label distribution into a high-dimensional discrete distribution, serving as credible prototypes for calibrations. Based on this, ProWTP views the alignment between prototypes and instance representations as a Semi-relaxed Unbalanced Optimal Transport (SUOT) problem, where the marginal constraints of prototypes are relaxed. And the corresponding optimization problem is reformulated as a weighted Lasso problem for solution. Moreover, ProWTP introduces the assignment and compactness losses to encourage instances to cluster closely around their respective prototypes, thereby enhancing the prototype-level distinguishability. Finally, we conducted extensive offline experiments on two industrial datasets, demonstrating our consistent superiority in real-world application.
Lay Summary: Video platforms like TikTok or YouTube try to predict how long you’ll watch a video, so they can recommend content you’re more likely to enjoy. But this is harder than it seems. People behave differently depending on video length, and existing prediction systems often miss these patterns or become confused when similar behaviors look different in the data. We introduce a new method, called ProWTP, that helps recommendation models better understand and organize viewing behavior. First, we summarize how people watch videos of different lengths into representative behavior patterns — kind of like identifying key “viewing styles.” Then we teach the model to align its internal understanding with these patterns using a technique from mathematics that’s often used to match two sets of things fairly. Our method works with any existing recommendation model and leads to more accurate watch-time predictions. This could help platforms make smarter content suggestions, saving users time and helping creators reach the right audience more effectively.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications
Keywords: Prototype learning, optimal transport, recommendation
Submission Number: 13964
Loading