Abstract: With the surge in research on LLMs, various methods for evaluating and attacking the robustness of LLMs have emerged and attracted increasing attention. Traditional adversarial attack methods often have a high dependency on the victim model, leading to poor transferability of generated adversarial samples. The applicability of the obtained attack samples is limited to the current white-box model, making it difficult to transfer attacks to other black-box models. In the scenario of LLMs, problems like poor attack effectiveness and slow attack speed become more pronounced in traditional adversarial attack methods. Through the analysis of traditional text adversarial attack methods, we propose a method capable of producing attack samples with better transferability. Additionally, it enhances attack success rates and greatly improves attack speed.
Paper Type: long
Research Area: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies
Loading