Judo: A Juxtaposed Domain-oriented Multimodal Reasoner for Industrial Anomaly QA

ICLR 2026 Conference Submission16985 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Industrial Anomaly QA, Large Multimodal models
TL;DR: We introduce JUDO, a novel framework that enhances industrial anomaly detection by juxtaposing defective and normal images for domain-oriented visual reasoning while injecting domain knowledge through reinforcement learning.
Abstract: Industrial anomaly detection has been significantly advanced by large multimodal models (LMMs), enabling diverse human instructions beyond detection, particularly through visual-grounded reasoning for better image understanding. However, the lack of domain-specific knowledge of LMMs limits the accurate generation of responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multi-modal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and text reasoning. The visual reasoning provides detailed inspection by segmenting the defect region in the query image by juxtaposing it with the normal image as visual domain context, enabling a fine-grained visual comparative analysis. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with three tailored rewards. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding. The implementation of JUDO can be found in https://anonymous.4open.science/r/JUDO-9C8B.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16985
Loading