Keywords: language-image pre-training, vision-language model, part-to-whole recognition, visual grounding
TL;DR: We propose a part-to-whole alignment for vision-language pre-training to achieve comprehensive scene understanding.
Abstract: Large vision-language models such as CLIP align images and captions as wholes but falter on long, detailed descriptions. Fine-grained understanding demands capture of hierarchical semantics, seeing both forest and trees,
within and across domains. Yet syntactic and semantic structures seldom mirror visual organization, and vision alone tends to create spurious fragments unless text anchors and unifies.
We propose F-CAST, a hierarchical image-text representation learning framework that discovers aligned spatially oriented text and visual hierarchies directly from image and long-caption corpora, without region-sentence labels.
It uses a CAST visual encoder for fine-to-coarse scene parsing and a hierarchical transformer text encoder that first encodes each sentence then fuses them into a whole-caption representation.
A two-level alignment loss, extending FLAIR, aligns whole images with whole texts while biasing image-sentence matches so coarse concepts emerge from fine-grained evidence rather than ignoring it.
Trained on 30M image--text pairs, F-CAST delivers strong scaling and sets state-of-the-art performance on six long-text benchmarks.
Experiments show that hierarchical alignment of vision and language enables F-CAST to discover fine-grained, visually grounded text understanding without supervision.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23982
Loading