Abstract: Vision-language models (VLMs) have achieved remarkable success in various vision-language tasks, such as image captioning and visual question answering. However, these models often lack physical common sense, frequently failing to identify visually evident violations of common physical principles. Therefore, evaluating the VLMs' understanding of physical common sense is essential, which has not yet been systematically explored in existing research. To fill this gap, we introduce PhyVIB (Physical Common Sense Violation Image Benchmark). This novel benchmark consists of 16,000 images across eight categories, aiming to systematically assess the VLMs' capability to detect violations of physical common sense in images. Our evaluations show that even the state-of-the-art VLMs perform poorly on PhyVIB, highlighting a significant area for improvement. In response, we propose PhyDetector, a two-stage fine-tuning framework to enhance the VLMs' capability to detect violations of physical common sense. The first stage involves supervised fine-tuning, which equips the VLM with essential concepts related to visual physical anomalies. The second stage utilizes group relative policy optimization to enhance the VLM's multimodal reasoning capability on physical plausibility. Experimental results show that the model fine-tuned with PhyDetector can significantly outperform the state-of-the-art VLMs in physical common sense understanding. Our artifacts are available at https://github.com/ZitongWang018/PhyVIB.
External IDs:doi:10.1145/3746027.3755124
Loading