DCR$^2$-AD: Dynamic Context Routing and Reasoning Multi-Modal Large Language Model for Anomaly Detection
Keywords: Anomaly Detection, Multi-modal Large Language Model, Reasoning, Chain-of-Thought
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have shifted the anomaly detection paradigm from traditional classification-based approaches toward a novel diagnostic framework based on MLLM-driven question answering. In contrast to conventional architectures characterized by “single-scenario, single-purpose designs”, these models use pretraining to attain robust generalization capabilities and provide expert-level diagnostic performance. However, current MLLM-based anomaly detection methods rely predominantly on internalized knowledge of visual defects, which limits their effectiveness in open-domain settings where anomalies demonstrate significant cross-scenario ambiguity. For example, logical anomalies differ fundamentally from common visual defects, and hence cannot be effectively identified using conventional visual defect-based rules. To overcome this limitation, we propose an innovative Dynamic Context Routing and Reasoning model (DCR²-AD), which integrates knowledge-routed reasoning trajectory synthesis (KR-RTS) and knowledge-routed direct preference optimization (KR-DPO) to improve the model’s capacity for appropriate external knowledge utilization during reasoning. We first constructed an object-agnostic knowledge base encompassing extensive defect-related knowledge. By substituting knowledge from correct reasoning trajectories with information drawn from incorrect trajectories, we synthesized erroneous reasoning trajectories. Furthermore, we introduce the KR-DPO algorithm, which conditions on the selectively routed knowledge to promote correct reasoning trajectories and suppress incorrect ones, thereby refining the model’s ability to identify optimal reasoning pathways. Through extensive experiments, our approach achieves state-of-the-art performance, attaining 83.36\% on the comprehensive MMAD benchmark, surpassing the base model by 6.00\%, outperforming ordinary human by 4.67\%, and exceeding the previous best method by 1.41\%. These significant gains substantiate the efficacy of our proposed framework. Our code and data will be made publicly available upon publication of the paper.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15049
Loading