How to Generalize the Detection of AI-Generated Text: Confounding Neurons

How to Generalize the Detection of AI-Generated Text: Confounding Neurons

ACL ARR 2025 May Submission6835 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Detectors of LLM-generated text suffer from poor domain shifts generalization ability. Yet, reliable text detection methods in the wild are of paramount importance for plagiarism detection, integrity of the public discourse, and AI safety. Linguistic and domain confounders introduce spurious correlations, leading to poor out-of-distribution (OOD) performance. In this work we introduce the concept of confounding neurons, individual neurons within transformers-based detectors that encode dataset-specific biases rather than task-specific signals. Leveraging confounding neurons, we propose a novel post-hoc, neuron-level intervention framework to disentangle AI-generated text detection factors from data-specific biases. Through extensive experiments we prove its ability to effectively reduce topic-specific biases, enhancing the model’s ability to generalize across domains.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: generalization,model editing,topic modeling,domain adaptation,text classification

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6835

Loading