## Overview

BAPO is designed as a post-training framework built on `verl`, a versatile library for reinforcement learning in language models. We extends traditional on-policy algorithms (e.g., GRPO, DAPO) by introducing sophisticated off-policy controls, enabling efficient reuse of historical data and adaptive sample selection.


## 1. Core Off-Policy Components and Corresponding Functions of BAPO

BAPO's off-policy mechanism is built around key components that manage off-policy data storage, selection, and integration. Below are the core controls and their functionalities:


### 1.1 `OffPolicyDataSample`
A data structure to encapsulate individual off-policy samples, storing critical information for adaptive filtering:
- **Attributes**:
  - `uid`: Unique identifier for the sample (e.g., prompt ID).
  - `score`: Reward score of the sample.
  - `is_alpha`: Boolean indicating if the sample is generated by the "alpha" policy (reference policy).
  - `sample_data`: Raw data of the sample (e.g., prompt, response).
  - `timestamp`: Creation time for eviction based on recency.
- **Function**: `to_dict()`: Serializes the sample to a dictionary for storage/transmission.


### 1.2 `OffPolicyGroupBuffer`
A buffer to manage a collection of `OffPolicyDataSample` instances, with mechanisms to control capacity and sample quality:
- **Key Methods**:
  - `add_score(uid_str, scores, is_alpha)`: Adds reward scores (from either alpha or current policy) to track sample quality.
  - `add_sample(uid_str, sample)`: Adds a new off-policy sample to the buffer, with automatic eviction when capacity is exceeded.
  - `_check_and_evict(target_range, buffer_range)`: Evicts samples to maintain buffer capacity, prioritizing samples in target (high-quality) and buffer (difficult) ranges.
- **Purpose**: Maintains a balanced set of off-policy samples by limiting UID count, controlling per-UID sample numbers, and filtering based on reward ranges.


### 1.3 `OffPolicyGroupBuffer` (Core Off-Policy Management Component)  
`OffPolicyGroupBuffer` is the core component for managing off-policy data in BAPO. It handles storage, filtering, updating, and sampling of historical data to ensure the quality and diversity of off-policy data. Its key functionalities include:  

#### Data Storage Structure  
- Organizes data by unique identifiers (UIDs, typically corresponding to prompts), maintaining:  
  - `alpha_scores`: Scores of samples generated by the rollout policy (delay v steps).  
  - `current_scores`: Scores of samples generated by the current policy.  
  - `sample_data`: Raw sample data (e.g., prompts and responses).  
  - `uid_order`: Records the order in which UIDs are added, enabling recency-based eviction.  

#### Core Methods  
- `add_score(uid_str, scores, is_alpha)`: Adds sample scores (distinguishing between reference policy and current policy) and limits the maximum number of scores stored per UID (`max_scores_per_uid`).  
- `add_sample(uid_str, sample)`: Adds off-policy samples to the buffer, automatically enforces the maximum number of samples per UID (`max_samples_per_uid`), and triggers cleanup logic when needed.  
- `_check_and_evict(target_range, buffer_range)`: Cleans up samples when the buffer exceeds the maximum UID capacity (`max_total_uids`). Prioritizes retaining UIDs associated with "target range" (high-quality samples) and "buffer range" (hard samples) to balance data distribution.  
- `sample_groups_by_filter(filter_func, target_groups)`: Samples a specified number of sample groups from the buffer using a filter function, facilitating integration of off-policy data during training.  
- `get_metrics()`: Returns statistical information about the buffer (e.g., total UIDs, sample count, average score), supporting buffer status monitoring.  


#### Purpose  
By dynamically maintaining high-quality and hard samples, `OffPolicyGroupBuffer` provides stable and effective data support for BAPO's off-policy learning, balancing data utilization and training stability.


## 2. How to Train with BAPO

### 2.1 Prerequisites
- Install dependencies (refer to `verl` framework requirements).
- Prepare training/validation data (in Parquet format) and a pre-trained base model (e.g., Qwen).


### 2.2 Key Configuration Parameters
BAPO's off-policy behavior is controlled via configuration parameters (see `config/ppo_trainer.yaml` and training scripts):

- `algorithm.enable_off_policy_samples`: Set to `True` to allow using off-policy data.
- `algorithm.enable_off_policy_rollout`: Set to `True` to generate off-policy rollouts.
- `algorithm.max_total_uids`: Maximum number of unique prompts (UIDs) to retain in the off-policy buffer (e.g., 256).
- `algorithm.off_policy_update_freq`: Frequency (in iterations) of updating the off-policy buffer (e.g., 5).
- `algorithm.adv_estimator`: Set to `grpo_adaptive_filter` to enable BAPO's off-policy logic.


### 2.3 Training Script
Use the provided script to start BAPO training (example for math tasks with DeepSeek-1.5B):

```bash
run_deepseek-1.5b_math_off_policy.sh
```


## Notes
- For on-policy baseline training, use `algorithm.adv_estimator=grpo` and disable off-policy flags (see `run_deepseek-1.5b_math_on_policy.sh`).
- Adjust `max_total_uids` and `off_policy_update_freq` based on dataset difficulty and computational resources.

