Logit-Based Universal Trigger Search for Attacking Claim Verification Models

Logit-Based Universal Trigger Search for Attacking Claim Verification Models

ACL ARR 2026 January Submission6492 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Claim Verification, Universal Adversarial Triggers, Model Robustness, Bias Diagnostics

Abstract: Despite their widespread adoption, the robustness of fact-checking models to adversarial perturbations remains underexplored. Existing approaches are typically model-specific and require gradient access or dataset-dependent optimization, with implications for generalization, efficiency, and semantic validity. We introduce FactFlip, a framework for analyzing robustness in claim verification models via universal adversarial triggers. FactFlip identifies highly perturbative trigger words through a lightweight, model-only analysis of classification logits, without relying on training data or gradient access. FactFlip decouples trigger discovery from claim perturbation and adopts an LLM-based perturb-and-verify pipeline to integrate them while preserving semantic validity. Experimental results show that FactFlip effectively exposes model vulnerabilities, achieving competitive attack success rates with greater stability and cross-model robustness than fully supervised baselines. Moreover, we show that the identified triggers are highly discriminative and exhibit compositional effects, providing evidence of systematic biases arising from both pre-training and fine-tuning.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks/examples/training, probing, robustness

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6492

Loading