Keywords: vision-language, physical scale, dynamic, disaster, adversarial, weather, caption, VQA, enrichment
Abstract: Vision-Language models (VLMs), i.e., image-text pairs of CLIP, have boosted image-based Deep Learning (DL).
Unseen images by transferring semantic knowledge from seen classes can be dealt with the help of language models
pre-trained only with texts. Two-dimensional spatial relationships and a higher semantic level have been performed.
Moreover, Visual-Question-Answer (VQA) tools and openvocabulary semantic segmentation provide us with more detailed
scene descriptions, i.e., qualitative texts, in captions.
However, the capability of VLMs presents still far from that of human perception. This paper proposes PanopticCAP
for refined and enriched qualitative and quantitative captions to make them closer to what human recognizes by
combining multiple DLs and VLMs. In particular, captions with physical scales and objects’ surface properties
are integrated by water level, counting, depth map, visibility distance, and road conditions. Fine-tuned VLM models
are also used. An iteratively refined caption model with a new physics-based contrastive loss function is used. Experimental
results using images with adversarial weather conditions, i.e., rain, snow, fog, landslide, flooding, and traffic
events, i.e., accidents, outperform state-of-the-art DLs and VLMs. A higher semantic level in captions for real-world
scene descriptions are shown.
Submission Number: 2
Loading