VENIS: Vision-centric Enhancement via Noise-Injection and Self-distillation for Multimodal Instruction Tuning
Keywords: MLLM, Multimodal Instruction Tuning, Vision-Centric, Noise-Injection, Self-Distillation
Abstract: Multimodal Large Language Models (MLLMs) have shown great potential and broad prospects, but their instruction tuning faces critical challenges: the models pay insufficient attention to visual information and tend to prioritize learning textual content. This vision-deficient tendency directly leads to inadequate performance enhancement, weak ability to generalize across different scenarios, and frequent generation of hallucinatory content that deviates from visual facts. Existing solutions like expanding datasets or scaling architectures incur high costs with diminishing returns. This work introduces VENIS (Vision-centric Enhancement via Noise-Injection and Self-distillation), a lightweight framework combining Noise-Injection and Self-distillation. It weakens textual priors by injecting random noise into instruction-response embeddings, forcing the model to ground its answers in visual information. Self-distillation then strengthens visual understanding while recovering textual knowledge. Experiments on LLaVA v1.5-7b and InternVL3-8B demonstrate consistent improvements across benchmarks. For LLaVA v1.5-7b, improvements include MMBench (+2.3%), MMVP (+7.4%), MMMU (+1.7%), and OCRBench (+1.6%). For InternVL3-8B, gains cover MMBench (+1.0%), MMMU (+3.1%), OCRBench (+4.6%), and HallusionBench (+1.7%). VENIS requires no additional data, annotations, or model modifications, offering an efficient reference for advancing multimodal instruction tuning.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15410
Loading