SciContrib-Bench: Mapping the Autonomy Landscape of AI Scientists Through Stage-Dependent Detectability
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: AI-generated text detection, AI scientists, scientific writing, stylometry, benchmark, stage-dependent detection, RoBERTa, paraphrase robustness, cross-generator transfer, Binoculars, attribution
TL;DR: A 1,632-segment cross-domain benchmark showing AI vs. human scientific text detectability depends jointly on pipeline stage, feature type, and evaluation condition, with stylometric and distributional methods producing opposite stage orderings
Abstract: We introduce SciContrib-Bench, a benchmark for evaluating how the distinguishability of AI-generated scientific contributions from human-written ones varies across four research pipeline stages. Using 1,632 matched segments from 300 papers in three domains, we find that detection difficulty depends strongly on both the pipeline stage and the feature type used for detection. An L1-regularized logistic regression on 16 stylometric features achieves 84.5% balanced accuracy (AUROC = 0.923), with a robust 19-percentage-point spread across stages (interpretation 93.5% vs. abstract 74.5%; cluster-robust permutation p = 0.001, mixed-effects model p < 0.001). A fine-tuned RoBERTa-large classifier achieves near-perfect in-domain detection (BA = 1.000) but proves fragile under distribution shift: back-translation paraphrasing drops BA to 0.518 while AUROC remains at 0.989, revealing that the model's ranking ability is robust but its hard decision boundary does not transfer. Cross-generator evaluation (train on DeepSeek-R1, test on Qwen-2.5-32B) yields BA = 0.810 with a 29-percentage-point stage range, demonstrating that stage-dependent detection emerges in RoBERTa under distribution shift. A proper same-family Binoculars evaluation (Qwen-2.5-0.5B/1.5B) achieves BA = 0.715 (AUROC = 0.800) and is consistent with the reversed stage ordering seen with individual perplexity baselines. Negative controls (shuffled labels: BA = 0.500; human-vs-human: FP = 0.000) validate that the detection signal is genuine. These findings establish that the choice of detection approach and evaluation condition jointly determine whether scientific AI contributions exhibit stage-dependent or stage-uniform detectability, with direct implications for attribution policy design.
Submission Number: 351
Loading