## Part 1: Main Workflow & Validation Scripts

This suite of five Python scripts constitutes a complete pipeline for benchmarking and validating Reward Models (RMs) in Large Language Models, covering data generation, benchmark curation, automated evaluation, and downstream task validation.

### 1. Tiered Candidate Generation (`data_generation.py`)

-   **Functionality**: This is the starting point of the pipeline. It takes a set of user prompts and uses a generator model (e.g., GPT-4o) to produce multiple candidate responses for each prompt.
-   **Key Feature**: For each prompt, it intentionally generates three quality tiers of answers (LOW, MEDIUM, HIGH) across several different temperature settings.
-   **Purpose**: To create a large, diverse candidate pool with a controlled quality gradient and stylistic variety, which serves as the foundation for building a robust benchmark.

### 2. Core Benchmark Curation (`rvb_select_core_from_grouped.py`)

-   **Functionality**: This script intelligently samples from the massive candidate pool to create a smaller, representative, and challenging public benchmark subset named `Eval-Core`.
-   **Process**:
    1.  **Scoring & Normalization**: It loads scores from multiple RMs for all candidates and applies quantile normalization to make them comparable.
    2.  **Meta-Metric Calculation**: It computes two meta-metrics for each prompt: **Span** (the RM's ability to separate good/bad responses) and **Consensus** (the level of agreement among different RMs).
    3.  **Two-Stage Sampling**: It first selects prompts with high span, then performs stratified sampling to balance prompts with high RM consensus (easy cases) and low RM consensus (hard/ambiguous cases). Finally, it selects 9 diverse candidates for each chosen prompt, ensuring a balanced quality distribution.
-   **Purpose**: To build a high-quality, reproducible benchmark that effectively tests RM performance in both common and challenging scenarios.

### 3. Automated Evaluation Pipeline (`rvb_eval.py`)

-   **Functionality**: An all-in-one script that automates the RM evaluation process.
-   **Process**:
    1.  **Data Flattening**: Converts the complex JSON data structure into a simple JSONL format (one prompt-response pair per line), which is compatible with standard evaluation tools.
    2.  **Scoring**: Provides two modes for scoring the flattened data: using the `rewardbench` CLI for multiple RMs or running a local inference pass with `transformers`.
    3.  **Metric Aggregation**: Gathers all scores, groups them by prompt, and calculates variance metrics like Range (RSI) and P90-P10 spread, exporting the results to a CSV.
-   **Purpose**: To provide a convenient and repeatable workflow for assessing the variance characteristics of any RM on a given dataset.

### 4. Downstream RLHF Validation (`ppo_...py` & `grpo_...py`)

-   **Functionality**: These two scripts validate the practical utility of the benchmark's variance metrics using downstream RLHF (Reinforcement Learning from Human Feedback) tasks. They use two different algorithms: PPO and GRPO.
-   **Process**:
    1.  **RLHF Training**: A base language model is fine-tuned using either PPO or GRPO, with the RM being tested acting as the "teacher" that provides the reward signal.
    2.  **Performance Tracking**: Key convergence metrics like reward mean and KL divergence are logged throughout the training process.
    3.  **Joint Analysis**: After training, the scripts compute metrics on training efficiency (e.g., reward AUC, steps to a KL threshold, early learning slope). Crucially, they merge these "teaching effectiveness" metrics with the RM's pre-computed variance scores from the benchmark.
-   **Purpose**: To empirically demonstrate the link between an RM's variance profile (as measured by the benchmark) and its actual performance in guiding an RLHF process. A strong correlation validates that the benchmark's metrics are predictive of real-world training efficiency.

## Part 2: Benchmark Construction & Core Analysis Scripts (in `rb_work_grouped/` subdirectory)

This specialized toolchain is responsible for building the official benchmark dataset and generating the final results and figures for the paper.

1.  **`rvb_select_core_from_grouped.py` - Benchmark Construction Script**
    * **Function**: Constructs the final `eval_core.jsonl` benchmark dataset from the large candidate pool via a sophisticated sampling strategy.
    * **Process**: It normalizes scores from multiple RMs, filters out low-quality/duplicate responses, and then performs stratified sampling on prompts based on two meta-metrics: **score span** (discriminative power) and **inter-model consensus** (agreement). This ensures the final dataset contains a balanced mix of "easy" and "hard" cases. It then selects a fixed number of candidates for each prompt, covering a clear quality gradient.

2.  **`rb_run_one.py` - Single-Model Evaluation Runner**
    * **Function**: A utility script designed to run the evaluation for a single RM on the `eval_core.jsonl` dataset. It is optimized for batch processing in automated or cluster environments and includes features like skipping completed jobs and fault-tolerance retries.

3.  **`rb_variance_pipeline.py` - Core Metric Analysis & Visualization Script**
    * **Function**: This is the primary script for generating the paper's final results. It processes all collected scores, calculates the core variance-oriented metrics (e.g., nGMD, SEI, DCI), computes a composite score to rank all RMs, and generates all key figures (leaderboard bar charts, variance profile scatter plots, task heatmaps, etc.).

4.  **`rvb_eval_core.py` - Auxiliary/Preliminary Evaluation Script**
    * **Function**: A self-contained, simplified evaluation script. It focuses on more traditional statistical metrics like score Range (RSI) and Bandwidth (BW) and provides a detailed analysis of inter-model ranking consensus (Kendall's Tau). It's suitable for quick preliminary analysis or specific consensus studies.