Keywords: vision-language, physical scale, dynamic, disaster, adversarial, weather, caption, VQA, enrichment
Abstract: Vision-Language models (VLMs), i.e., image-text
pairs of CLIP, have boosted image-based Deep
Learning (DL). Unseen images by transferring semantic
knowledge from seen classes can be dealt
with with the help of language models pre-trained
only with texts. Two-dimensional spatial relationships
and a higher semantic level have been
performed. Moreover, Visual-Question-Answer
(VQA) tools and open-vocabulary semantic segmentation
provide us with more detailed scene
descriptions, i.e., qualitative texts, in captions.
However, the capability of VLMs presents still
far from that of human perception. This paper
proposes PanopticCAP for refined and enriched
qualitative and quantitative captions to make them
closer to what human recognizes by combining
multiple DLs and VLMs. In particular, captions
with physical scales and objects’ surface properties
are integrated by counting, visibility distance,
and road conditions. Fine-tuned VLM models are
also used. An iteratively refined caption model
with a new physics-based contrastive loss function
is used. Experimental results using images with
adversarial weather conditions, i.e., rain, snow,
fog, landslide, flooding, and traffic events, i.e.,
accidents, outperform state-of-the-art DLs and
VLMs. A higher semantic level in captions for
real-world scene descriptions is shown.
Submission Number: 36
Loading