# RL Training and Attacking Framework for LLM Search Agents

This repository contains scripts and tools for training language models to search with reinforcement learning and develop prompt/prefill-based attacks.

## Script Locations

### Retrieval Scripts
- **Local Retrieval Setup**: `example/retriever/` - Scripts for launching different types of retrievers (BM25, e5, etc.)
- **Retrieval Documentation**: `docs/retriever.md` - Detailed guide on setting up retrieval systems

### PPO Training Scripts
- **Main Training Scripts**: `scripts/nq_hotpotqa/` - PPO training scripts for different model versions
- **Multinode Training**: `example/multinode/` - Scripts for distributed training across multiple nodes
- **Training Configuration**: `verl/trainer/config/` - YAML configuration files for training

### Attack Scripts
- **Local Attack Scripts**: `attack_scripts/infer_*_local_ppo_scripts/` - Scripts for running attacks with local search
- **Web Attack Scripts**: `attack_scripts/infer_*_web_ppo_scripts/` - Scripts for running attacks with web search

### Evaluation Scripts
- **Evaluation Scripts**: `eval_scripts/` - Scripts for evaluating three safety metrics: refusal, answer safety, search safety

### Datasets
- **Harmful Instruction Dataset**: `harmful_dataset/` - Contains 299 harmful instructions used for testing model safety

## Getting Started

1. **Setup Retrieval**: Follow the guide in `docs/retriever.md` to set up your retrieval system
2. **Training**: Use scripts in `scripts/nq_hotpotqa/` to train your search models
3. **Attacking**: Use attack scripts in `attack_scripts/` to attack models with prompt changes or prefills
4. **Evaluation**: Run evaluation scripts in `eval_scripts/` to test model safety performance on three safety metrics

## Requirements

See `requirements.txt` for Python dependencies and `VERL_README.md` for detailed installation instructions.