# Improving Blackbox LLMs with Feedback Reinforcement Learning

## Abstract

The optimization of blackbox large language models (LLMs) presents significant challenges. While the implementation of pre-existing chain-of-thought (CoT) prompting and feedback mechanisms help occasionally, these approaches still struggle from unreliable feedback, and fail to leverage the training data. In this work, we propose Feedback Reinforcement Learning (FRL)—training a separate feedback model through reinforcement learning to improve the main blackbox LLM. 

FRL divides self-correction into two stages: our trained feedback model identifies errors and generates corresponding feedback, while the blackbox LLM generates corrections based on this extra feedback. During training, the feedback model generates feedback rollouts for initial responses from a fixed pretrained model, which then produces revised responses. The improvement between initial and revised responses serves as the reward signal. This approach treats the solver model as a blackbox and optimizes it with a separate feedback provider, enabling targeted improvement without modifying the base model.

We evaluate FRL on generated Sudoku puzzles, GSM8K, and MMLU-STEM questions, demonstrating consistent improvements over the initial language model's performance by 10% on average. Our method outperforms both non-learning self-correction approaches and oracle-based verification methods by leveraging training data through reinforcement learning.

## Installation

### Prerequisites
- Python 3.8+
- CUDA-compatible GPU (for training)
- Accelerate library for multi-GPU training

### Setup
Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

### Feedback Inference
Run feedback inference on test datasets:

```bash
python src/feedback_inference.py \
    --initial_model_name "Qwen/Qwen2.5-1.5B" \
    --verifier_model_name "Qwen/Qwen2.5-1.5B" \
    --dataset_name "GSM" \
    --stop 9999999 \
    --max_new_tokens 1024 \
    --output_dir "experiment/" \
    --prompt_dir "prompts/base_model/feedback_GSM" \
    --skip_if_verified \
    --split "test" \
    --sudoku_num_prefilled 12
```

### Training FRL Models
Train a feedback model using FRL with multiple GPUs:

```bash
accelerate launch --config_file path_to_your_config.yaml src/server_train.py \
    --prompt_dir "$PROMPT_DIR" \
    --stop "$STOP" \
    --base_model_name "$BASE_MODEL_NAME" \
    --train_model_name "$TRAIN_MODEL_NAME" \
    --output_dir "$OUTPUT_DIR" \
    --n_epoch "$N_EPOCH" \
    --learning_rate "$LEARNING_RATE" \
    --batch_size "$BATCH_SIZE" \
    --gradient_accumulation_steps "$GRADIENT_ACCUMULATION_STEPS" \
    --num_generations "$NUM_GENERATIONS" \
    --temperature "$TEMPERATURE" \
    --max_new_tokens "$MAX_NEW_TOKENS" \
    --max_seq_length "$MAX_SEQ_LENGTH" \
    --how_many_checkpoints "$HOW_MANY_CHECKPOINTS" \
    --reward_to_use "$REWARD_TO_USE" \
    --beta "$BETA"
```
