# BADI - BLACK-BOX & ANYTIME-VALID DATASET IDENTIFICATION FOR LARGE LANGUAGE MODELS

This repository contains the reference implementation of the BADI methodology (Black-box & Anytime-valid Dataset Identification) currently under review as a conference paper at ICLR 2026.

## ABSTACT

Large language models (LLMs) are trained on massive, uncurated internet datasets
that often include copyrighted material, making training data identification essential
for intellectual property protection. Dataset inference (DI) addresses this challenge
by extracting diverse training membership features for a suspect set, aggregating
them, and applying statistical tests to assess if that suspect set contributed to the
model’s training. However, current DI methods face two major limitations that
hinder their practical deployment. First, they require gray-box access to token
probabilities, while state-of-the-art LLM APIs usually return only generated tokens.
We address this issue by approximating per-token probabilities from label-only
outputs, making black-box DI feasible. Second, existing DIs rely on p-value for
statistical tests that necessitate a fixed suspect set and a predetermined significance
level. This either leads to high computational costs for large suspect sets, especially
in the black-box setup, or yields inconclusive results for smaller sets, since adding
new suspect data points post-hoc might be necessary to provide strong enough
evidence, but it invalidates statistical guarantees based on p-values. To overcome
this limitation, we introduce a black-box DI framework based on e-values and
sequential testing. The e-values offer anytime-valid guarantees and support op-
tional continuation, enabling safe accumulation of evidence, reducing inconclusive
outcomes and compute costs. Through these two fundamental advances, our Black-
box and Anytime-valid Dataset Identification (BADI) method enables practical
data auditing for LLMs, supporting their trustworthy deployment.


## Key Files (in order)

1) `./main/raw_values.py`
   - Purpose: generate raw similarity per token from the dataset and `log_probability` (used only for reference models and plots).
   - Key function: `raw_values_batch`
     - Computes token-level similarity (per token) in batches.
     - Also collects per-token log-probabilities.
     - Outputs raw, per-token features for downstream use.

2) `./main/di.py`
   - After similarity is generated, this file builds model features ("mfeatures").
   - Consumes per-token similarities and prepares features for metric computation.

3) `./main/metrics.py`
   - Generates metrics.
   - Important functions:
     - `get_losses_from_dict`: estimates loss from token probabilities.
     - `aggregate_metrics`: given a list of estimated losses, computes all metrics (based on per-token similarity).

## BERT-score-based workflow
- `./core/get_predic.py`: generates model responses for the original prefix and for the perturbed prefix.
- `./main/baseline.py`: computes BERT-based metrics.
 
## Scoring by betting (e-value)
- `./main/linear_di_e_vals.py`: runs a sequential ablation framework that uses online Kernel MMD hypothesis testing on precomputed language model metrics to detect distributional differences across data splits, applying normalization and outlier handling, training simple classifiers, and outputting detailed CSV reports with statistical metrics and performance traces.
- `./main/aggregate_evalue_results.py`: aggregates sequential testing results by summarizing performance metrics and generating wealth trajectory plots (with confidence intervals and significance thresholds) for each dataset, saving outputs per dataset along with a shared legend at the root.

## Metric Types
- Token-level black-box features: computed via `./main/metrics.py`, `./main/di.py`, and `./main/raw_values.py` (per-token similarity based).
- Sequence-level black-box features: computed via `./core/get_predic.py` and `./main/baseline.py` (BERT-based).

# Pipeline

First create python environment:

```bash
python3 -m venv .badi  # Create a local virtual environment.
source .badi/bin/activate  # Activate it (repeat in any new shell before running steps below).
pip install -r requirements.txt  # Install project dependencies.
python -m spacy download en_core_web_sm # Download and install the small English core model for spaCy
```

Process the dataset:

```bash
python3 ./core/download_dataset.py ./data/data # first download the dataset 

python3 ./core/split_dataset.py --input-dir ./data/data --output-dir ./data/full_split_data \
   --tokenizer "EleutherAI/pythia-410m-deduped" --min-tokens 164 --max-val-train

python3 ./core/split_text_prefix_suffix.py \
    --input-dir ./data/full_split_data \
    --output-dir ./data/full_prefix_suffix_data \
    --model-name "EleutherAI/pythia-410m-deduped" \
    --suff-tokens 64 \
    --max-val-train

```

Process the dataset with perturbations:

Note that torch in version 2.3 is needed, otherwise torchtext won't be compatible with pytorch. After running a script split_text_prefix_trans_suffix.py, the correct version of pytorch from requirements.txt have to be reinstalled.

```bash
python3 ./core/split_text_prefix_trans_suffix.py \
    --input-dir ./data/full_split_data/ \
    --output-dir ./data/full_prefix_trans_suffix_data \
    --model-name "EleutherAI/pythia-410m-deduped" \
    --suff-tokens 64 \
    --max-val-train
```

For sequence-level metrics:

```bash
# generate the model predictions
python3 ../core/get_predic.py \
   --data_dir ./data/full_prefix_trans_suffix_data \
   --n_suff 64 \
   --model_name "EleutherAI/pythia-410m-deduped" \
   --batch_size 18 \
   --max-val-train \
   --use-transformations

python3 ../core/get_predic.py \
   --data_dir ./data/full_prefix_trans_suffix_data \
   --n_suff 64 \
   --model_name "EleutherAI/pythia-410m-deduped" \
   --batch_size 18 \
   --max-val-train

# run bert scores for transformations
python3 ../main/baseline.py \
   --model_name EleutherAI/pythia-410m-deduped \
   --data_dir ./data/full_prefix_trans_suffix_data/processed_predic/ \
   --result_output ./data/results/full_baseline \
   --n_suff 64 \
   --metrics_folder ./data/metrics/bert \
   --max-val-train \
   --between-predictions-bert-score

# run bert score for standart test
python3 ../main/baseline.py \
    --model_name EleutherAI/pythia-410m-deduped \
    --data_dir ./data/full_prefix_suffix_data/processed_predic/ \
    --result_output ./data/results/full_baseline \
    --n_suff $N_SUFF \
    --metrics_folder ./data/metrics/bert \
    --max-val-train 

```

For token-level metrics:

```bash
# 1. step - generate raw_values
python3 ./main/batch_raw_values.py \
        --data_dir ./data/full_split_data \
        --model_name EleutherAI/pythia-410m-deduped \
        --result_output ./data/raw_values \
        --cache_dir ~/.cache \
        --max-val-train

# 2. step - generate metrics
python3 ./main/batch_di.py \
        --data_dir ./data \
        --model_name EleutherAI/pythia-410m-deduped \
        --result_output ./data/metrics/final_metrics/metrics_sigmoid \
        --cache_dir ~/.cache \
        --max-val-train \
        --loss_estimation_method "sigmoid" \
        --reference_model_names "" \

# 2.1 merge token-level metrics with sequence-level metrics
python3 "./utils/merge_metrics.py" ./data/metrics/bert ./data/metrics/final_metrics/metrics_sigmoid EleutherAI/pythia-410m-deduped

# 3. step - for e-values
python3 ./main/linear_di_e_vals.py \
  --dataset_name wikipedia \
  --model_name EleutherAI/pythia-410m-deduped \
  --lambda_max 0.8 \
  --num_trials 1000 \
  --online_epochs 30 \
  --metrics_path ./data/metrics/final_metrics/metrics_sigmoid/EleutherAI_pythia-410m-deduped \
  --output_dir ./data/results/final_results/e_values

python ./main/aggregate_evalue_results.py \
  --root ./data/results/final_results/e_values \
  --alpha1 0.05 \
  --alpha2 0.01

# 3. step - for p-values
python3 ./main/batch_linear_di.py \
          --results_dir ./data/metrics/final_metrics/metrics_sigmoid \
          --output_dir ./data/results/final_results/stats_sigmoid \
          --model_name EleutherAI/pythia-410m-deduped \
          --num_random 4 \
          --percent_to_train 0.5 \
          --outliers "mean" \
          --no-test 

python3 ./utils/plot_linear_di_heatmap.py \
          --base_dir ./data/results/final_results/stats_sigmoid \
          --pvalues_subdir "p_values/mean-outliers/train-normalize-selected_features/EleutherAI_pythia-410m-deduped" \
          --output_name "linear_di_heatmap.pdf" \
          --number_to_combine 4 \
          --title "heatmap"
```

This code is inspired by the work from this repository: [pratyushmaini/llm_dataset_inference](https://github.com/pratyushmaini/llm_dataset_inference). Many thanks to pratyushmaini for the valuable contribution and inspiration.