Think-to-Detect: Rationale-Driven Vision–Language Anomaly Detection

Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah, Hyun-Soo Kang

Published: 08 Dec 2025, Last Modified: 07 Jan 2026MathematicsEveryoneRevisionsCC BY-SA 4.0
Abstract: Large vision–language models (VLMs) can describe images fluently, yet their anomaly decisions often rely on opaque heuristics and manual thresholds. We present ThinkAnomaly, a rationale-first vision–language framework for industrial anomaly detection. The model generates a concise structured rationale and then issues a calibrated yes/no decision, eliminating per-class thresholds. To supervise reasoning, we construct chain-of-thought annotations for MVTec-AD and VisA via synthesis, automatic filtering, and human validation. We fine-tune Llama-3.2-Vision with a two-stage objective and a rationale–label consistency loss, yielding state-of-the-art classification accuracy while maintaining a competitive detection AUC: MVTec-AD—93.9% accuracy and 93.8 Image-AUC; VisA—90.3% accuracy and 85.0 Image-AUC. This improves classification accuracy over AnomalyGPT by +7.8 (MVTec-AD) and +12.9 (VisA) percentage points. The explicit reasoning and calibrated decisions make ThinkAnomaly transparent and deployment-ready for industrial inspection.
Loading