Keywords: Claim Verification, Universal Adversarial Triggers, Model Robustness, Bias Diagnostics
Abstract: Despite their widespread adoption, the robustness of fact-checking models to adversarial perturbations remains underexplored.
Existing approaches are typically model-specific and require gradient access or dataset-dependent optimization, with implications for generalization, efficiency, and semantic validity.
We introduce FactFlip, a framework for analyzing robustness in claim verification models via universal adversarial triggers. FactFlip identifies highly perturbative trigger words through a lightweight, model-only analysis of classification logits, without relying on training data or gradient access. FactFlip decouples trigger discovery from claim perturbation and adopts an LLM-based perturb-and-verify pipeline to integrate them while preserving semantic validity.
Experimental results show that FactFlip effectively exposes model vulnerabilities, achieving competitive attack success rates with greater stability and cross-model robustness than fully supervised baselines. Moreover, we show that the identified triggers are highly discriminative and exhibit compositional effects, providing evidence of systematic biases arising from both pre-training and fine-tuning.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks/examples/training, probing, robustness
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 6492
Loading