Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation
Keywords: Vision–Language Modeling, Contrastive Supervision, Engagement Modeling, Multimodal Representation Learning, Importance-Weighted Clustering, Explainable Evaluation Metrics, Short-Form Video Understanding, Semantic Rubric Discovery
Abstract: Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks assess visual and semantic fidelity, they often fail to capture the specific audiovisual “hooks” that drive real-world audience engagement. In this work, we propose a discovery-driven evaluation framework that identifies the underlying logic of virality in edutainment content. Our approach leverages Vision-Language Models (VLMs) to extract granular semantic evidence, which is then analyzed using a contrastive supervision signal comparing high-performing and low-performing content. By applying importance-weighted k-means clustering, we discover latent, human-interpretable rubric items without predefined categories. We further introduce a saturation-based aggregation function to score videos, preventing “metric gaming.” Experiments demonstrate that our framework achieves a strong Spearman rank correlation (ρ = 0.74) with actual engagement, significantly outperforming traditional quality baselines and providing a scalable, explainable path toward robust video understanding.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality, Vision–Language Modeling, Machine Learning for NLP, Interpretability, Evaluation, Representation Learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 10008
Loading