Keywords: vision language models, vlms, learning dynamics, generalization
TL;DR: We develop a methodology to evaluate VLM training dynamics with train/validation curves, revealing how different visual skills generalize versus memorize during the single-epoch training process.
Abstract: Vision language models (VLMs) are trained on massive amounts of data to perform many visual tasks simultaneously.
Accordingly, many VLM benchmarks have been recently created to properly evaluate the models' capabilities.
However, relatively little has been done to understand how and when the model acquires particular skills during training.
We evaluate checkpoints throughout a one-epoch VLM training on recently seen and unseen datapoints to capture the generalization dynamics during model learning.
We categorize the training data into five broad visual reasoning groups (Bounding, Complex, Object, OCR, and Semantic questions) and observe when these skills are learned.
We note for example that despite not being explicitly trained to do OCR, VLMs can quickly learn to perform OCR tasks better than object recognition tasks.
Digging deeper, we perform a case study on how VLMs use visual cues to solve OCR questions, indicating a form of shortcut that is not captured by standard VLM benchmarks.
In contrast to OCR questions which are quickly learned, bounding capabilities are inefficiently learned due to the the complexity of the bounding box format -- despite the fact that bounding box questions comprise the majority of the training data.
Our work provides a glimpse into the underlying learning process of VMs on the LLaVA dataset.
Submission Type: Short Research Paper (< 4 Pages)
Submission Number: 23
Loading