DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI detection, machine-generated text, multilingual, code generation, text classification, content authenticity, synthetic text detection, natural language processing, AI generation content detection, multilingual NLP
TL;DR: Fine-tuned small encoder models (RoBERTa, CodeBERTa) with custom datasets surpass few-shot LLMs in detecting machine-generated multilingual text and code - delivering higher accuracy with efficient compute.
Abstract: The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA & CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC = 0.97 to 0.99 and macro-F1 0.89 to 0.94 while reducing latency by 8-12× and peak VRAM by 3-5× at 512-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/ renaming), performance retains >= 92% of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
Submission Number: 83
Loading