GROVE: Geometry-Aware Optimization for Robust Vision-Language Models

Published: 01 Jun 2026, Last Modified: 01 Jun 2026CVPR 2026 Workshop WiCV Proceedings Track PosterEveryoneRevisionsCC BY 4.0
Keywords: Vision–Language Models (VLMs), Computer Vision, Foundation Models, Geometry-Aware Optimization, Robust Training, Embedding Stability, Curvature-Sensitive Gradients, Multimodal Learning, Contrastive Learning, Self-Supervised Learning, Cross-Modal Alignment
Abstract: Vision-language foundation models integrate heterogeneous signals from images and text to support predictive tasks ranging from captioning and retrieval to representation learning. Despite their strong performance, training these models remains brittle under modality imbalance, noisy supervision, and distribution shifts, often producing unstable embeddings and highly curved optimization landscapes. We propose GROVE (Geometry-Aware Robust Optimization for Vision-Language Models), a principled optimization framework that explicitly accounts for the underlying geometry of vision-language embedding manifolds. GROVE introduces curvature-sensitive gradient updates that guide training toward flatter, more stable minima, improving robustness against inherent noise, cross-modal misalignment, and data perturbations. We instantiate GROVE across vision-language encoders and recommendation models, demonstrating consistent gains in generalization, embedding stability, and downstream performance over standard training. Beyond task-specific improvements, GROVE establishes a foundation-level optimization principle for robust vision–language training, broadly applicable across contrastive, autoregressive, and self-supervised learning paradigms.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 25
Loading