Keywords: Backdoor Detection, Large Vision Language Models, Semantic Collapsing
Abstract: Stealthy backdoor attacks on large vision–language models (LVLMs) are difficult to detect because the attacker can suppress responses to generic probes and break the usual similarity-to-target/distance-to-target detection logic. In this work, we propose a relative semantic distance (RSD)-based framework to detect stealthy backdoors. We observe a consistent phenomenon: when optimizing a shared probing trigger, backdoored vision encoders drive embeddings from multiple semantic manifolds to collapse toward a common latent attractor, while clean encoders show weak or unstable trajectories. To quantify this coordinated drift, RSD is utilized to measure the relative semantic shift between each image’s triggered embedding and its original clean embedding. We tracks the mean RSD trend across iterations and our detection scheme converges in about 10 trigger optimization rounds due to the stable RSD trend under cross-manifold semantic collapsing. Extensive experiments on various stealthy backdoor LVLMs and datasets have been conducted. The proposed scheme can achieve over 0.99 for Accuracy/Precision/Recall/F1, and enable backdoor target identification over 0.99 with Top-5 candidates.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: security and privacy, multimodality
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8266
Loading