Refined and Enriched Captions With Physical Scale For Dynamic Disaster Scene

CVPR 2023 Workshop NFVLR Submission2 Authors

Published: 12 Jun 2023, Last Modified: 15 Jun 2023CVPR 2023 Workshop NFVLR Withdrawn SubmissionEveryoneRevisions
Keywords: vision-language, physical scale, dynamic, disaster, adversarial, weather, caption, VQA, enrichment
Abstract: Vision-Language models (VLMs), i.e., image-text pairs of CLIP, have boosted image-based Deep Learning (DL). Unseen images by transferring semantic knowledge from seen classes can be dealt with the help of language models pre-trained only with texts. Two-dimensional spatial relationships and a higher semantic level have been performed. Moreover, Visual-Question-Answer (VQA) tools and openvocabulary semantic segmentation provide us with more detailed scene descriptions, i.e., qualitative texts, in captions. However, the capability of VLMs presents still far from that of human perception. This paper proposes PanopticCAP for refined and enriched qualitative and quantitative captions to make them closer to what human recognizes by combining multiple DLs and VLMs. In particular, captions with physical scales and objects’ surface properties are integrated by water level, counting, depth map, visibility distance, and road conditions. Fine-tuned VLM models are also used. An iteratively refined caption model with a new physics-based contrastive loss function is used. Experimental results using images with adversarial weather conditions, i.e., rain, snow, fog, landslide, flooding, and traffic events, i.e., accidents, outperform state-of-the-art DLs and VLMs. A higher semantic level in captions for real-world scene descriptions are shown.
Submission Number: 2
Loading