# dgMARK: Decoding-Guided Watermarking for Diffusion Language Models


### Setup

**Install dependencies**:
   ```bash
   # Install PyTorch with CUDA support (example)
   pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

   # Install other dependencies
   pip install -r requirements.txt
   ```


## Dataset Setup

This project uses the C4 validation dataset from HuggingFace. You have two options for dataset setup:

### Option 1: Automatic Download

The dataset will be downloaded automatically on first run:

```bash
python scripts/generate.py --method original --num_samples 10 --dataset_url https://huggingface.co/datasets/allenai/c4/resolve/main/realnewslike/c4-validation.00000-of-00001.json.gz
```

This will download `c4-validation.00000-of-00001.json.gz` (~364MB) to the current directory.

### Option 2: Manual Download

If you prefer to download manually or want to specify a custom location:

1. **Download the dataset**:
   ```bash
   wget https://huggingface.co/datasets/allenai/c4/resolve/main/realnewslike/c4-validation.00000-of-00001.json.gz
   ```

2. **Use custom dataset path**:
   ```bash
   python scripts/generate.py --dataset_path /path/c4-validation.00000-of-00001.json.gz --method original
   ```

## Quick Start

```bash
python scripts/generate.py --method original --num_samples 10
python scripts/generate.py --method watermark --num_samples 10 --sampling_strategy multinomial
python scripts/detect.py --watermarked generated_results_watermark.csv --original generated_results_original.csv --plot detection_results.png
```

### 1. Generate Baseline (Non-Watermarked) Text

```bash
python scripts/generate.py --method original --num_samples 50 --output_prefix baseline
```


### 2. Generate Watermarked Text

**With private key (adds randomness to parity mapping):**
```bash
python scripts/generate.py --method watermark --num_samples 50 --sampling_strategy multinomial --private_key 12345
```

### 3. Run Detection Analysis

**With private key (must match generation key):**
```bash
python scripts/detect.py \
  --watermarked generated_results_watermark.csv \
  --original generated_results_original.csv \
  --private_key 12345 \
  --plot detection_results.png
```

## Generation Methods

The project supports four different generation methods:

### 1. Non-Watermark (Greedy Sampling)
Basic LLaDA generation with argmax token selection:
```bash
python scripts/generate.py --method original --sampling_strategy greedy
```

### 2. Non-Watermark (Multinomial Sampling)
```bash
python scripts/generate.py --method original --sampling_strategy multinomial --top_k 3 
```

### 3. Watermark (Greedy Sampling)
**With private key:**
```bash
python scripts/generate.py --method watermark --sampling_strategy greedy --private_key 54321
```

### 4. Watermark (Multinomial Sampling)
**With private key:**
```bash
python scripts/generate.py --method watermark --sampling_strategy multinomial --top_k 3 --private_key 54321
```

### 5. Beam Search  

**With greedy sampling:**
```bash
python scripts/generate.py --method beam --beam_size 3 --sampling_strategy greedy
```

**With multinomial sampling:**
```bash
python scripts/generate.py --method beam --beam_size 3 --sampling_strategy multinomial --top_k 3
```

**With private key (applies to both sampling strategies):**
```bash
python scripts/generate.py --method beam --beam_size 3 --sampling_strategy greedy --private_key 98765
```

## Detection Analysis

### Statistical Detection

The detection system uses z-score analysis to distinguish watermarked from non-watermarked text:

```bash
python scripts/detect.py \
  --watermarked results_watermarked.csv \
  --original results_original.csv \
  --threshold_z 4.0 \
  --min_length 200 \
  --plot detection_plot.png
```

### Key Parameters

- `--threshold_z`: Z-score threshold for detection (default: 4.0)
- `--min_length`: Minimum sequence length to include (default: 200)
- `--plot`: Save detection visualization plot


## Robust Detection Workflow


**1. Analyze with robust detection:**
```bash
# Analyze original text
python scripts/robust_detection.py --mode analyze \
  --input_csv generated_results_original.csv \
  --output_scores ./original_z_scores.txt \
  --window_size 8

# Analyze watermarked text
python scripts/robust_detection.py --mode analyze \
  --input_csv generated_results_watermark.csv \
  --output_scores ./watermarked_z_scores.txt \
  --window_size 8
```

**2. Compute AUC value:**
```bash
python scripts/robust_detection.py --mode auc \
  --original original_z_scores.txt \
  --watermark watermarked_z_scores.txt
```

**With private key (must match generation key):**
```bash
python scripts/robust_detection.py --mode analyze \
  --input_csv generated_results_watermark.csv \
  --output_scores watermarked_z_scores.txt \
  --window_size 8 \
  --private_key 12345
```
