Evaluation-Conditioned Trojan Attack

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: Adversarial Robustness, Machine Unlearning, Benchmarking, Evaluation, Large Language Models
Abstract: Benchmarking is considered the iron rule of machine learning research, and its ability to measure progress depends on several prerequisites, including transparency and a fair comparison across models. In this work, we examine the reliability of evaluating open-weight models on static benchmarks and investigate how upstream interventions can violate the assumptions above and thus impede evaluation validity. Given representative benchmarks, our simulation study examines how language model benchmark performance can deteriorate and how standard fine-tuning enables recovery from this degradation. We propose several simple implementations for this \textit{Evaluation-Conditioned Trojan Attack} threat model that can achieve this goal: (1) degrading instruction following capability through SFT on instruction data by shuffling instruction–response pairs; (2) applying unlearning algorithms such as negative preference optimization and gradient ascent. We further study a universal setting: a single trojan attack degrades many benchmarks via unlearning on one proxy dataset or bi-level reweighting that upweights least-recoverable sources. Across 15 models and 6 benchmarks, we find that brief recovery fine-tuning restores 64–142\% of the original performance from trojanized checkpoints.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 92
Loading