EviInspect: Evidence-Grounded Annotation and Evaluation for Safety-Critical Industrial Inspection

Nikolaos Marios Militsis; Achilleas Toumpas; Ilias Koulalis; Konstantinos Ioannidis; Stefanos Vrochidis

EviInspect: Evidence-Grounded Annotation and Evaluation for Safety-Critical Industrial Inspection

Nikolaos Marios Militsis, Achilleas Toumpas, Ilias Koulalis, Konstantinos Ioannidis, Stefanos Vrochidis

Published: 02 Jun 2026, Last Modified: 02 Jun 2026Greeks in AI 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision language models, Industrial inspection, Evidence-grounded reasoning

Domains: Vision and Learning, Other

TL;DR: Vision-based inspection system

External Link: https://openaccess.thecvf.com/content/CVPR2026W/VISION26/html/Militsis_EviInspect_Evidence-Grounded_Annotation_and_Evaluation_for_Safety-Critical_Industrial_Inspection_CVPRW_2026_paper.html

Abstract: Vision-based inspection systems increasingly support safety-critical decision-making in industrial settings. Their reliability depends not only on predictive accuracy, but also on whether visual evidence is interpreted in a manner consistent with domain-specific rules. In many inspection tasks, labels such as hazard level are not intrinsic visual properties; instead, they are defined through external domain references, including standards and empirical studies, that connect observable evidence to downstream operational consequences. However, many existing vision and multimodal reasoning benchmarks implicitly treat such labels as directly observable visual ground truth. As Vision Language Models (VLMs) are increasingly used in inspection pipelines, datasets are needed to evaluate not only prediction accuracy, but also whether decisions are supported by sufficient visual evidence under domain-consistent rules. To address this need, we introduce EviInspect, an evidence-guided annotation framework in which inspection labels are derived by explicitly linking visual evidence to external domain references. EviInspect combines AI-assisted evidence extraction with human verification to assess whether the available evidence is sufficient to support an assigned label and whether the decision is reproducible given the same evidence. We demonstrate the framework in a Foreign Object Debris (FOD) inspection use case and release FOD-A-H, an evidence-grounded extension of the public FOD-A dataset, with hazard and size annotations derived from Federal Aviation Administration (FAA) reference material. Using FOD-A-H, we evaluate state-of-the-art VLMs on their ability to derive inspection labels under explicit evidence constraints.

Submission Number: 41

Loading