LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

Irina Tolstykh; Aleksandra Tsybina; Sergey Yakubson; Maksim Kuprashevich

LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Maksim Kuprashevich

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI-generated text detection, Dataset, Mixed-authorship detection, Natural Language Processing, AI Safety, Large Language Models

TL;DR: We introduce a large-scale, bilingual dataset for both classifying and localizing AI-generated text, featuring the first dataset with precise character-level annotations for mixed-authorship scenarios.

Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10503

Loading