IPS: In-Prompt Process Supervision for Short Video Content Moderation

Mingchao Liu; Yu Sun; Ruixiao Sun; Xin Dong; Xiang Shen; Hongwei Wang; Hongyu Xiong; Yang Song

IPS: In-Prompt Process Supervision for Short Video Content Moderation

Mingchao Liu, Yu Sun, Ruixiao Sun, Xin Dong, Xiang Shen, Hongwei Wang, Hongyu Xiong, Yang Song

Published: 18 Apr 2026, Last Modified: 24 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Recommendation, Content Understanding, LLM, Process Supervision, Noise-aware Learning, LoRA, Multi Modality

Abstract: Multimodal large language models (MLLMs) are effective at capturing the semantics of short video content; however, they often fail to attend to the policy-specific details required for reliable content moderation. To address this limitation, we introduce IPS, a novel framework that integrates In-prompt Process Supervision into MLLMs by introducing sequential reasoning over ancillary questions during fine-tuning. IPS consistently outperforms baseline MLLMs on public and proprietary benchmarks. Moreover, replacing human-annotated ancillary labels with MLLM-generated ones results in only marginal performance degradation, demonstrating robustness to noisy supervision and strong scalability with model-generated annotations. These findings establish IPS as a scalable and effective solution for complex multimodal classification in large-scale industrial settings.

Submission Type: Deployed

Copyright Form: pdf

Submission Number: 302

Loading