When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

ACL ARR 2026 January Submission2857 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: personalization, machine-generated text detection, interpretability, large language model
Abstract: As large language models (LLMs) increasingly imitate personal writing styles, personalization has become a key challenge for machine-generated text (MGT) detection. Yet personalized MGT detection remains largely underexplored. In this work, we introduce \dataset, the first benchmark for evaluating detector robustness under personalization, built from literary and blog texts paired with their LLM-generated imitations. Experiments across diverse detectors show pronounced performance instability under personalization, with frequent inversions relative to general-domain behavior. To better understand this limitation, we conduct an in-depth analysis and attribute it to a \textit{feature-inversion trap}, i.e., features that are effective for separating human-written text (HWT) from MGT in general flip their effect in personalized contexts, ultimately misleading detectors. Motivated by this, we propose \method, a diagnostic framework for predicting detector robustness under personalization. \method identifies the inverted features and quantifies detector dependence using perturbed texts pronounced in the features. In our experiments, \method predicts both the direction and magnitude of cross-domain performance shifts with an 85\% correlation to actual outcomes. We hope this work will raise awareness of the structural risks introduced by personalization and motivate more robust approaches to personalized MGT detection. \faGithub~\href{https://anonymous.4open.science/r/Personalized_MGT_Detect-8678}{Github.}\footnote{Code is also available in the supplementary metarials.}
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,evaluation,statistical testing for evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2857
Loading