Abstract: We propose BridgeCLIP, an innovative framework designed to harness the power of vision-language models for bridge inspection from images. BridgeCLIP is a CLIP-based multi-label classifier that finds multiple damages in a single bridge image. Pre-trained vision-language models learn the relationships between general objects by millions of text-image pairs, of which descriptions are not precise enough for domain-specific problems. Following the concept that humans normally learn the visual appearance of bridge damage by reading a manual, we introduce a novel Description Attention Module (DAM) to incorporate the domain-specific knowledge extracted from the professional descriptions in bridge inspection manuals. By utilizing both general knowledge of the pre-trained CLIP and professional knowledge of bridge inspection, BridgeCLIP comprehensively learns the inter-class relationships of different damages. Experimental results on bridge inspection datasets show that BridgeCLIP outperforms the state-of-the-art multi-label classifiers.
External IDs:doi:10.1007/978-3-031-78447-7_5
Loading