Keywords: Visual Language Models, Bayesian Invariant Risk Minimization, Out-of-Distribution Generalization, Task Invariance
TL;DR: We introduce Dynamic Bayesian IRM for visual language models, which adaptively adjusts regularization strength during training to learn task-invariant features, achieving +52% improvement on out-of-distribution OCR tasks.
Abstract: While Visual Language Models (VLMs) excel on multimodal tasks, they suffer from performance degradation under distribution shift, particularly when facing out-of-distribution (OOD) tasks not seen during training. Traditional Empirical Risk Minimization (ERM) often fails to learn task-invariant features, relying instead on spurious correlations. Although Invariant Risk Minimization (IRM) offers a solution, its application to generative multimodal settings remains unexplored, and it suffers from regularization decay in deep networks.
We bridge this gap by adapting Bayesian IRM (BIRM) for generative VLMs. We formalize "environments" in multimodal data through task types (e.g., VQA, Captioning, OCR), treating distribution shift as a shift between tasks. To address regularization decay, we propose Dynamic BIRM — an algorithm that adaptively adjusts the invariance penalty strength throughout training, maintaining an optimal balance between empirical and invariant risk.
Our experiments on the LLaVA-OneVision dataset with SmolVLM-2B demonstrate that Dynamic BIRM significantly outperforms ERM and static BIRM baselines, achieving a +33.8\% absolute improvement in CEE score (a comprehensive LLM-based evaluation metric) on challenging OOD OCR tasks while maintaining or improving in-domain performance. Our analysis reveals that adaptive mitigation of regularization decay is key to learning truly task-invariant features, leading to substantial robustness against task-based distribution shifts. Code and models will be released.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24353
Loading