Keywords: zero-shot, out-of-distribution detection
TL;DR: This paper introduces ELCM, an entropy-based weighting method that improves zero-shot OOD detection by emphasizing reliable patches and reducing false positives without additional training.
Abstract: Zero-shot out-of-distribution detection with vision-language models faces a fundamental challenge: how to reliably aggregate patch-level information without being misled by spurious activations from noisy or ambiguous image regions. Existing approaches like GL-MCM use simple max-pooling over local patch confidences, treating all patches equally and making systems vulnerable to false alarms from misleading alignments on background elements or partial out-of-distribution content. We introduce Entropy-Weighted Local Concept Matching (ELCM), a principled information-theoretic framework that addresses this critical limitation by automatically assessing patch reliability through uncertainty quantification. For each spatial patch, ELCM computes probability distributions over in-distribution classes, measures Shannon entropy to quantify prediction uncertainty, and applies exponential weighting that emphasizes confident patches while suppressing ambiguous ones. This entropy-driven aggregation replaces heuristic max-pooling with theoretically-grounded patch importance assignment, requiring no additional training while maintaining strict zero-shot constraints. Extensive evaluation demonstrates substantial improvements in detection reliability: overall AUROC increases from 0.9129 to 0.9188 with 15 percent reduction in false positive rates (FPR95: 0.3495 to 0.2975). Notably, ELCM achieves 19 percent FPR95 reduction on iNaturalist and 23 percent reduction on SUN, with consistent improvements across diverse visual domains including natural scenes, architectural environments, and texture patterns. The method addresses a fundamental gap in vision-language OOD detection and establishes entropy-based aggregation as an effective paradigm for robust patch-level reasoning in complex visual environments.
Supplementary Material: zip
Submission Number: 91
Loading