CLAD: Continual Learning for Robust Adversarial Text Detection and Repair in Resource-Constrained Scenarios

15 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: adversarial detection, adversarial defense, continual learning, adversarial attack
TL;DR: CLAD, a continual learning framework that detects and repairs textual adversarial attacks with high robustness and transferability in low-resource settings.
Abstract: Textual adversarial attacks present a critical threat to NLP systems by subtly altering inputs to deceive models, necessitating robust detection and defense mechanisms. Traditional methods, however, suffer from high computational costs, poor generalization to unseen attacks, and vulnerability to distribution shifts, particularly in resource-constrained scenarios where adversarial example sampling is expensive and scarce. To address these challenges, we propose CLAD, a continual learning-based framework for adversarial detection and repair, designed to enhance robustness and transferability in low-resource environments. By leveraging continual learning, CLAD mitigates catastrophic forgetting of learned adversarial patterns and incrementally improves generalization as new attack types are introduced. CLAD integrates two adversarial repair methods that preserve semantic fidelity while neutralizing perturbations. Across four text classification datasets and three primary attacks (BAE, PWWS, TextFooler), CLAD improves with larger memory buffers (MS ∈ {0,1,10,100}) and exhibits reduced forgetting. The best detection accuracy reaches 82.20% (Amazon, in-domain, MS=100), while on the same dataset defense achieves up to 99.65% defense accuracy (D.A.) and 84.73% recovery accuracy (R.A.) against TextFooler via PD_LLM.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6181
Loading