![code](./assets/code.png)

# DivEye

This repo is an official implementation of DivEye for the paper *Diversity Boosts AI-Generated Text Detection* (accepted to TMLR '26). You may visit the GitHub version of the code: `https://github.com/IBM/diveye`.

Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in domains such as education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. Existing detectors often rely on likelihood-based heuristics  or black-box classifiers, which struggle against high-quality generations and lack interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

✨ We provide a small dataset of 21 AI-generated arXiv abstracts sourced from the BiScope benchmark, generated using Claude-3.5-Sonnet, available in `./dataset/.` Additionally, we include 21 paraphrased versions of each abstract using three commercial paraphrasing tools: GPTinf, GPTZero, and Quillbot.

---
## 1. Installation
- Clone this anonymous repository to your local machine.
- Install conda and run the following commands to set up your environment.
```bash
conda create -n diveye python=3.11
conda activate diveye
pip install transformers scikit-learn tqdm numpy pandas xgboost scipy
```

## 2. Execution
You can run the code using the following command:
```bash
python3 diveye.py --model={model} --train_dataset={train_dataset.csv} --test_dataset={test_dataset.csv}
```

For more details about the arguments, please refer to the table below:
| **Argument**      | **Default / Choices**                                                                                   | **Explanation**                                                                                                                                                                                                                                           |
|-------------------|---------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `--model`  | **Required** <br> The model must be available on HuggingFace to be used. | Specifies the model for feature extraction in DivEye. |
| `--train_dataset` | **Required** <br> *Format:* `{name}.csv` | Indicates the training dataset. |
| `--test_dataset`  | **Required** <br> *Format:* `{name}.csv` | Indicates the testing dataset.  |

## 3. License
Our source code is under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).

