Robust CLIP-Guided Deep Thinking: A Two-Stage Optimization Strategy for Enhancing Adversarial Robustness and Reliability in LVLMs
Abstract: Large Vision-Language models (LVLMs) have demonstrated remarkable performance in a wide range of vision-language tasks as an efficient input/output system. However, the lack of adversarial robustness at the input side and the widespread hallucination phenomenon at the output side significantly undermine user trust in them. Current solutions to the former tend to sacrifice the general performance of LVLMs, while solving the latter requires a large amount of engineering costs. To address these challenges, we propose a two-stage optimization strategy called RCDT (Robust CLIP-guided Deep Thinking), which aims to enhance the adversarial robustness of LVLMs with minimal general performance loss while reducing hallucinations. First, we introduce a constrained adversarial fine-tuning approach for CLIP to limit the general performance loss during the enhancement of robustness. Furthermore, this CLIP is used to think deeply about the output process of LVLMs to reduce hallucinations. Experiments show that RCDT not only reduce general performance loss by more than half while maintaining adversarial robustness compared to the baselines, but also demonstrate good performance in mitigating hallucinations.
External IDs:dblp:conf/icassp/SuiH0ZR025
Loading