BridgeCLIP: Automatic Bridge Inspection by Utilizing Vision-Language Model

Powei Liao, Gaku Nakano

Published: 01 Jan 2025, Last Modified: 17 Mar 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: We propose BridgeCLIP, an innovative framework designed to harness the power of vision-language models for bridge inspection from images. BridgeCLIP is a CLIP-based multi-label classifier that finds multiple damages in a single bridge image. Pre-trained vision-language models learn the relationships between general objects by millions of text-image pairs, of which descriptions are not precise enough for domain-specific problems. Following the concept that humans normally learn the visual appearance of bridge damage by reading a manual, we introduce a novel Description Attention Module (DAM) to incorporate the domain-specific knowledge extracted from the professional descriptions in bridge inspection manuals. By utilizing both general knowledge of the pre-trained CLIP and professional knowledge of bridge inspection, BridgeCLIP comprehensively learns the inter-class relationships of different damages. Experimental results on bridge inspection datasets show that BridgeCLIP outperforms the state-of-the-art multi-label classifiers.

External IDs:doi:10.1007/978-3-031-78447-7_5