Hacking LM Arena via LLM Identification with Interpolated Preference Learning

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM identification
TL;DR: We presented I-PREF , a new model-driven approach to identify the source of LLM responses, which leverages triplet-based training with adaptive and iterative curriculum learning, supported by synthetic negatives generated via model interpolation
Abstract: Voting-based leaderboards, such as LM Arena, have become the predominant method for evaluating large language models (LLMs) on open-ended tasks, with their fairness fundamentally depending on the anonymity of model responses. While prior work has shown that simple statistical features can be used for LLM identification, these methods could be easily defended and lack the power to distinguish between stylistically similar models. To further investigate such a risk in more sophisticated ways, we introduce a model-driven LLM identification framework via learning from Interpolated preference data (I-PREF). Our approach utilizes a triplet ranking loss to train the detector model, a process augmented with synthetic hard negatives generated via copy model fine-tuning and model interpolation. This strategy enables the detector to learn deep relational patterns beyond superficial statistics. We further enhance performance and stabilize training through adaptive and iterative curriculum learning. Experimental results show that I-PREF significantly outperforms the existing baselines, achieving improvements of about 30% in accuracy and 24% in AUROC.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24505
Loading