Keywords: LLM identification
TL;DR: We presented I-PREF , a new model-driven approach to identify the source of LLM responses, which leverages triplet-based training with adaptive and iterative curriculum learning, supported by synthetic negatives generated via model interpolation
Abstract: Voting-based leaderboards, such as LM Arena, have become the predominant
method for evaluating large language models (LLMs) on open-ended tasks, with
their fairness fundamentally depending on the anonymity of model responses.
While prior work has shown that simple statistical features can be used for LLM
identification, these methods could be easily defended and lack the power to distinguish
between stylistically similar models. To further investigate such a risk in
more sophisticated ways, we introduce a model-driven LLM identification framework
via learning from Interpolated preference data (I-PREF). Our approach
utilizes a triplet ranking loss to train the detector model, a process augmented
with synthetic hard negatives generated via copy model fine-tuning and model
interpolation. This strategy enables the detector to learn deep relational patterns
beyond superficial statistics. We further enhance performance and stabilize training
through adaptive and iterative curriculum learning. Experimental results show
that I-PREF significantly outperforms the existing baselines, achieving improvements
of about 30% in accuracy and 24% in AUROC.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24505
Loading