Abstract: Detectors of LLM-generated text suffer from poor domain shifts generalization ability. Yet, reliable text detection methods in the wild are of paramount importance for plagiarism detection, integrity of the public discourse, and AI safety. Linguistic and domain confounders introduce spurious correlations, leading to poor out-of-distribution (OOD) performance.
In this work we introduce the concept of confounding neurons, individual neurons within transformers-based detectors that encode dataset-specific biases rather than task-specific signals.
Leveraging confounding neurons, we propose a novel post-hoc, neuron-level intervention framework to disentangle AI-generated text detection factors from data-specific biases.
Through extensive experiments we prove its ability to effectively reduce topic-specific biases, enhancing the model’s ability to generalize across domains.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: generalization,model editing,topic modeling,domain adaptation,text classification
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6835
Loading