Abstract: Visual inspection robots used in factories and outdoor environments require the ability to accurately recognize visual differences between similar objects and further verbalize the recognition results to present the differences to humans. Despite the application of Large Language Models (LLMs) and multimodal LLMs across various domains, our research highlights their insufficiency in verbalizing nuanced differences across images. To address this, we leveraged LLMs and image generation AI to develop a dataset aimed at assessing difference recognition capabilities. We introduced two novel tasks using this dataset: selecting images based on their visual differences and a conditional difference captioning task, and evaluated existing Vision-Language Models (VLMs) on these tasks. Our findings reveal that advanced models like GPT-4V can describe subtle differences with comparative expressions, yet they fall short of matching human performance across all attributes. This discrepancy between model and human recognition, especially in identifying easily discernible differences, suggests that most current models lack the ability to directly compare image pairs for difference detection. Consequently, we propose a new model that incorporates an image-text similarity approach in the difference recognition task, showing superior performance over existing models, including GPT-4V. Our dataset and findings will contribute to advancements in differencing objects and improve robotic applications in visual inspection and object picking. The dataset is available at DICTA challenge page.
Loading