VENIS: Vision-centric Enhancement via Noise-Injection and Self-distillation for Multimodal Instruction Tuning

Yifei Luo; Guojing Ge; Dong Yi; Lunyi Chen; Guibo Zhu

VENIS: Vision-centric Enhancement via Noise-Injection and Self-distillation for Multimodal Instruction Tuning

Yifei Luo, Guojing Ge, Dong Yi, Lunyi Chen, Guibo Zhu

19 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM, Multimodal Instruction Tuning, Vision-Centric, Noise-Injection, Self-Distillation

Abstract: Multimodal Large Language Models (MLLMs) have shown great potential and broad prospects, but their instruction tuning faces critical challenges: the models pay insufficient attention to visual information and tend to prioritize learning textual content. This vision-deficient tendency directly leads to inadequate performance enhancement, weak ability to generalize across different scenarios, and frequent generation of hallucinatory content that deviates from visual facts. Existing solutions like expanding datasets or scaling architectures incur high costs with diminishing returns. This work introduces VENIS (Vision-centric Enhancement via Noise-Injection and Self-distillation), a lightweight framework combining Noise-Injection and Self-distillation. It weakens textual priors by injecting random noise into instruction-response embeddings, forcing the model to ground its answers in visual information. Self-distillation then strengthens visual understanding while recovering textual knowledge. Experiments on LLaVA v1.5-7b and InternVL3-8B demonstrate consistent improvements across benchmarks. For LLaVA v1.5-7b, improvements include MMBench (+2.3%), MMVP (+7.4%), MMMU (+1.7%), and OCRBench (+1.6%). For InternVL3-8B, gains cover MMBench (+1.0%), MMMU (+3.1%), OCRBench (+4.6%), and HallusionBench (+1.7%). VENIS requires no additional data, annotations, or model modifications, offering an efficient reference for advancing multimodal instruction tuning.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15410

Loading