# Judge's Verdict

A comprehensive framework for evaluating Large Language Models (LLMs) as judges for assessing the quality of AI-generated responses.

## Overview

This repository contains the evaluation pipeline for running Judge's Verdict on annotated datasets. The framework uses LiteLLM as the officially supported approach for interfacing with various LLM providers through a unified API.

**Supported Providers via LiteLLM:**
- OpenAI (GPT-4, GPT-4o, etc.)
- Anthropic (Claude models)
- NVIDIA NIM (Llama, Nemotron, Gemma, etc.)
- Local models (via vLLM or other compatible servers)
- Many other providers (see [LiteLLM docs](https://docs.litellm.ai/docs/providers))

> **⚠️ Internal Repository Note**  
> This repository contains internal configuration files and features that will not be included in the public release:
> - `config/judge_config_internal.yaml` - Internal testing configuration (will be removed)
> - Any references to internal models or endpoints
> 
> The public repository will only include the LiteLLM-based configurations.

## Features

- **Flexible Judge Configuration**: YAML-based configuration system for managing multiple judge models
- **Multiple Serving Frameworks**: Support for various LLM serving backends
- **Annotation Preparation**: Scripts for preparing and merging human annotations
- **Batch Processing**: Efficient batch evaluation with configurable workers
- **Token Usage Tracking**: Built-in token usage monitoring for cost analysis
- **Extensible Architecture**: Easy to add new judge models and metrics

## Installation

```bash
# Clone the repository
git clone https://github.com/your-username/judges-verdict-internal.git
cd judges-verdict-internal

# Install using Poetry (recommended)
poetry install

# Or install with pip
pip install -e .
```

## Project Structure

```
judges-verdict-internal/
├── llm_judge_benchmark/         # Core package
│   ├── scoring/                 # Judge scoring implementation
│   ├── utils/                   # Utility functions
│   ├── judge_config.py          # Judge configuration classes
│   └── judge_config_manager.py  # YAML config management
├── scripts/                     # Standalone scripts
│   └── data_prep.py            # Unified data preparation script
├── config/                      # Configuration files
│   ├── judge_config_internal.yaml   # Internal use only - will be removed in public repo
│   ├── judge_config_litellm.yaml
│   └── judge_config_litellm_example.yaml
├── data/                        # Data directory
│   ├── annotations_4.json      # Pre-existing annotations
│   ├── annotations_full.json   # Merged annotations (created by data_prep.py)
│   ├── annotations_sample.json # Sample annotations
│   ├── coral_annotations.json  # CORAL dataset annotations
│   └── dc767_annotations.json  # DC767 dataset annotations
├── results/                     # Evaluation results
└── docs/                        # Documentation
```

## Quick Start

### 1. Install dependencies

```bash
# Using Poetry (recommended)
poetry install

# Or using pip
pip install -e .
```

### 2. Set up API keys

The project uses the LiteLLM framework as the officially supported approach for judge scoring, which provides a unified interface for multiple LLM providers.

```bash
# For OpenAI models (gpt-4o, gpt-4, etc.)
export OPENAI_API_KEY="your-openai-api-key"

# For Anthropic models (claude-sonnet-4, etc.)
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# For NVIDIA NIM models (llama, nemotron, gemma, etc.)
export NVIDIA_NIM_API_KEY="your-nvidia-nim-api-key"
export NVIDIA_NIM_API_BASE="https://integrate.api.nvidia.com/v1"  # or your custom endpoint

# For other providers supported by LiteLLM, see: https://docs.litellm.ai/docs/providers
```

Note: The LiteLLM framework automatically handles provider-specific authentication and API formatting.

### 3. Prepare the data

```bash
# Run the unified data preparation script
python scripts/data_prep.py
```

This will:
- Load pre-existing annotations from `annotations_4.json`
- Download CORAL dataset from HuggingFace and join with `coral_annotations.json`
- Download DC767 dataset from GitHub and join with `dc767_annotations.json`
- Merge all datasets into `annotations_full.json`

### 4. Run Judge Scoring

```bash
# Using the installed script
poetry run llm-judge-score \
    --annotation-file data/annotations_full.json \
    --output-dir results/

# Or run directly
python -m llm_judge_benchmark.scoring.llm_judge_scoring \
    --annotation-file data/annotations_full.json \
    --config config/judge_config_litellm.yaml \
    --output-dir results/
```


## Configuration

### Judge Configuration

Judges are configured using YAML files with LiteLLM as the primary framework. Each judge requires:

- `identifier`: Unique identifier for the judge (it's also used as the folder name for judge evaluation results)
- `framework`: Always use `litellm` for the officially supported approach
- `model`: Model name in LiteLLM format (e.g., `openai/gpt-4o`, `anthropic/claude-sonnet-4-20250514`, `nvidia_nim/meta/llama-3.1-70b-instruct`)
- `temperature`: Temperature setting (usually 0.0 for consistency)
- `max_tokens`: Maximum tokens for response
- `num_workers`: Number of parallel workers
- `timeout`: Request timeout in seconds

The model identifiers in the configuration YAML will be the folder names where results will be stored. For example:

```yaml
models:
  gpt-4o:  # This creates results in results/gpt-4o/
    framework: litellm
    model: openai/gpt-4o
    
  meta_llama-3.1-70b-instruct:  # This creates results in results/meta_llama-3.1-70b-instruct/
    framework: litellm
    model: nvidia_nim/meta/llama-3.1-70b-instruct
```

### Local Model Configuration

Local models are also supported by the `litellm` framework. When configuring a local model, you need to specify:
- `framework: litellm`
- `model`: The model name in litellm format (e.g., `hosted_vllm/...`)
- `base_url`: The URL of your local model server
- `api_key`: Usually set to "EMPTY" for local models

Example configuration for a local Qwen model:

```yaml
local/qwen-0.5b:
  framework: litellm
  model: hosted_vllm/Qwen/Qwen2-0.5B-Instruct
  base_url: http://localhost:8000/v1
  api_key: EMPTY
  num_workers: 1
```

This assumes you have a VLLM server running locally on port 8000 serving the Qwen2-0.5B-Instruct model.

The repository includes configuration files for the LiteLLM framework:
- `config/judge_config_litellm.yaml`: Main configuration file with all supported judge models using LiteLLM
- `config/judge_config_litellm_example.yaml`: Minimal example configuration with verified working models
- `config/judge_config_internal.yaml`: **Internal use only** - Legacy configuration for internal testing (will be removed in public repo)

### Environment Variables

The framework requires API keys to be set as environment variables. See the Quick Start section above for details on setting up API keys for different providers.

### LiteLLM Model Name Format

When configuring models in the YAML files, use the following format for model names:

- **OpenAI**: `openai/model-name` (e.g., `openai/gpt-4o`, `openai/gpt-4`)
- **Anthropic**: `anthropic/model-name` (e.g., `anthropic/claude-sonnet-4-20250514`)
- **NVIDIA NIM**: `nvidia_nim/provider/model-name` (e.g., `nvidia_nim/meta/llama-3.1-70b-instruct`)
- **Local Models**: `hosted_vllm/model-path` (e.g., `hosted_vllm/Qwen/Qwen2-0.5B-Instruct`)

For a complete list of supported providers and their formats, see the [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers).

## Annotation Data Format

The framework expects annotation data in the following format:

```json
[
  {
    "item_name": "unique_identifier",
    "question": "The question text",
    "answer": "The AI-generated answer",
    "reference_answer": "The reference answer",
    "annotations": [
      {
        "annotator_id": "annotator_1",
        "score": 85,
        "feedback": "Optional feedback text"
      }
    ]
  }
]
```

The `data_prep.py` script handles downloading and merging data from multiple sources:
- **annotations_4.json**: Pre-existing annotations with questions and answers
- **coral_annotations.json**: Annotations for the CORAL dataset (questions/answers downloaded from HuggingFace)
- **dc767_annotations.json**: Annotations for the DC767 dataset (questions/answers downloaded from GitHub)

## Command Line Options

The main scoring script supports the following options:

```bash
llm-judge-score \
    --annotation-file PATH      # Path to annotation JSON file (default: ./data/annotations_full.json)
    --output-dir PATH          # Output directory for results (default: ./results/)
    --judges NAME [NAME ...]   # Specific judges to run (space-separated list, optional)
    --max-trials N             # Maximum trials per judge (default: 3)
    --default-workers N        # Default workers for judges not in config (default: 16)
    --config-file PATH         # Path to judge config YAML file (optional)
```

## Troubleshooting

- **Import errors**: Make sure to install the package with `poetry install` or `pip install -e .`
- **Missing API keys**: Set the appropriate environment variables for your judge models
- **Missing data files**: Run `python scripts/data_prep.py` to download and prepare all annotation data
- **File not found errors**: Ensure the data preparation script has been run to create `annotations_full.json`

## Extending the Framework

### Adding a New Judge Model

1. Add the model configuration to your YAML config file using the LiteLLM framework
2. Ensure you use the correct LiteLLM model name format (e.g., `provider/model-name`)
3. Set up the required API keys for your model provider

### Adding New Metrics

1. Create a new metric class following the RAGAS metric interface
2. Update the scoring script to use the new metric
3. Add configuration options as needed

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

