# Training-free LLM-generated Text Detection by Mining Token Probability Sequences

This project provides the core code for two main algorithms—**Lastde** and **Lastde++**—presented in our paper.

We follow the standard testing procedures outlined in [Fast-DetectGPT](https://github.com/baoguangsheng/fast-detect-gpt/tree/main) to evaluate each detection algorithm.


<p align="center">
<img src="resources/Lastde_framework.png" width="100%">></a> <br>
</p>

## Contents

- [Environment](#Environment)
- [Source Model and Proxy Model](#Model)
- [Dataset](#Dataset)
- [Detection](#Detection)

## Project Structure

```markdown
Lastde/
├── datasets/
│   ├── human_llm_data_for_experiment/
│   └── human_original_data/
├── experiments_results/
│   ├── fast_detectgpt_detection_results/
│   ├── lastde_doubleplus_detection_results/
│   └── statistic_detection_results/
├── pretrain_models/
│   ├── gpt-j-6b/
│   └── Llama-3-8B/
├── py_scripts/
│   ├── baselines/
|   |   |—— scoring_methods/
|   |   |—— untils/
|   |   |—— lastde_doubleplus.py
|   |   |—— fast_detect_gpt.py
|   |   └── statistic_detect.py
│   └── data_generations/
|       └── data_generation_opensource.py
└── shell_scripts/
    ├── detection_black_box.sh
    └── detection_white_box.sh
```

## Environment

- Python3.8
- Pytorch2.0.0
- Other dependencies:
  ```python
  pip install -r requirements.txt
  ```
  (Note: Our experiments were conducted on two RTX 3090 GPUs with 24GB of memory each.)

## Source Model and Proxy Model

The `pretrain_models` directory is used to store open-source models, including those used as proxies or for generating text produced by LLMs. Here, we take `gpt-j-6b` and `Llama-3-8B` as examples, and the model weights can be downloaded from the following addresses:
- [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b/tree/main)
- [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main)

## Dataset

The dataset is divided into two parts:
- The `human_original_data` directory contains raw text in json format, with the Xsum dataset (i.e., **xsum.json**) as an example.
- The `human_llm_data_for_experiment` directory stores the complete data used for experiments, with **xsum_llama3_8b.raw_data.json** as an example. This dataset needs to be obtained by running 
    ```python
    python py_scripts/data_generations/data_generation_opensource.py
    ```
    (Note : We have already provided the data here, so there is no need to run) Each complete data entry contains two parts: 'original' (human-written text) and 'sampled' (LLM-generated text), with the content of the two types of text corresponding to each other. The 'sampled' text is generated by using the first 30 tokens of the corresponding 'original' text as prompt input to the source model (in this case, Llama-3-8B) for continuation, and all entries are truncated to the same length.

## Detection

Running **detection_white_box.sh** or **detection_black_box.sh** in `shell_scripts` will trigger white-box and black-box detection on xsum_llama3_8b.raw_data.json, respectively
```shell
cd shell_scripts

# white-box setting
./detection_white_box.sh 

# black-box setting
./detection_black_box.sh
```

The detection methods include:
- Likelihood, LogRank, Entropy, LRR, **Lastde(ours)**. Results will be saved in `experiment_results/statistic_detection_results`.
- Fast-DetectGPT. Results will be saved in `experiment_results/fast_detectgpt_detection_results`.
- **Lastde++(ours)**. Results will be saved in `experiment_results/lastde_doubleplus_detection_results`.

The code for the above detection methods is encapsulated in `py_scripts/baselines`.
