Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

ACL ARR 2026 January Submission8462 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal Language Models, Self-evolution, Unsupervised Reinforcement Learning
Abstract: In the unsupervised self-evolution of Multimodal Large Language Models (MLLMs), the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may merely reflect the model’s inherent preferences rather than the objective correctness of the reasoning paths. To counteract the degradation, we propose $\textbf{C}$ontinuous $\textbf{S}$oftened $\textbf{R}$etracing re$\textbf{S}$ampling ($\textbf{CSRS}$) in MLLM self-evolution. Specifically, we introduce a Retracing Resampling Mechanism ($\textbf{RRM}$) that reinference from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward ($\textbf{SFR}$), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation ($\textbf{VSP}$) , CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: multi-modal dialogue systems, commonsense reasoning
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8462
Loading