Keywords: Large Vision-Language Models;Industrial Defect Detection;Parameter-Efficient Fine-tuning
Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities in aligning textual and visual modalities across diverse natural image datasets. Despite these advances, deploying them directly for industrial defect detection remains challenging, primarily due to significant domain discrepancies. Industrial images typically exhibit distinct visual characteristics such as complex textures, low contrast, metallic reflections, and subtle localized anomalies that differ fundamentally from natural scenes. Furthermore, the fine-grained semantic alignment between domain-specific textual prompts and their corresponding visual regions remains underexplored, thereby limiting the precise localization and recognition of defects. Compounding these issues, industrial datasets are often limited in annotated samples per defect category, rendering full-model fine-tuning impractical and prone to overfitting. To overcome these challenges, we propose a novel fine-tuning framework that combines low-rank adaptation applied selectively to the attention modules of the Grounding DINO architecture with a carefully designed prompt engineering strategy tailored for industrial defects. Our approach leverages lightweight, parameter-efficient updates together with semantically rich, domain-specific prompts, enabling effective adaptation of pretrained LVLMs using minimal labeled data. We construct a comprehensive dataset comprising approximately 30,000 high-resolution industrial images spanning a wide range of defect categories for rigorous evaluation. Extensive experiments demonstrate that our method consistently outperforms competitive baselines across diverse industrial scenarios,achieving superior detection accuracy as measured by both mAP@0.5 and AR across all sizes of defects, while requiring only a fraction of trainable parameters. Our work presents a scalable, annotation-efficient, and semantically aware solution for real-world industrial visual inspection, effectively harnessing the power of LVLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18014
Loading