Abstract: Baggage screening in airports is a cornerstone in airport security measures. The advent of computer vision technologies in recent years has led to the development of several automated systems for identifying security threats in baggage scans. However, existing methods struggle to adapt to new threat categories when faced with a scarcity of data samples, and the rapid emergence of new threats. Hence, in this paper, we propose a novel CLIP-driven few-shot framework (CLIFS) to explore the potential of multi-modality using text-image fusion through contrastive learning to learn relevant contextual features for recognizing security threats with limited samples. By integrating features from GPT-4 generated captions with image features, CLIFS leverages both visual and textual data to significantly improve threat classification performance with limited samples in a few-shot learning context. Our proposed CLIFS was rigorously tested on the SIXray public available baggage X-ray dataset, where it outperformed state-of-the-art by 31.3% in accuracy and 28.40% in F1-score for the challenging 5-shots scenario, demonstrating its robustness and effectiveness in classifying threats from limited data samples.
External IDs:dblp:conf/icip/AhmedVERBHDBW24
Loading