---
title: Judge's Verdict Leaderboard
emoji: ⚖️
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: "Judge's Verdict: Benchmarking LLM as a Judge"
sdk_version: 5.19.0
---

# ⚖️ Judge's Verdict: LLM-as-a-Judge Leaderboard

A comprehensive benchmark for evaluating how well LLM judges align with human preferences when assessing AI-generated responses.

## 🎯 Overview

As Large Language Models (LLMs) are increasingly used to evaluate other AI systems, understanding their alignment with human judgment becomes critical. **Judge's Verdict** provides a systematic framework for measuring this alignment through:

- **Correlation Analysis**: Measuring how well judge scores correlate with human ratings
- **Cohen's Kappa**: Assessing inter-rater agreement beyond chance
- **Outlier Detection**: Identifying cases where judges significantly disagree with humans
- **Multi-Domain Evaluation**: Testing across diverse datasets and question types

## 📊 Leaderboard Metrics

Our leaderboard ranks LLM judges based on:

1. **Overall Correlation** (Pearson r): How well judge scores correlate with average human scores
2. **Overall Cohen's Kappa**: Agreement with human annotators accounting for chance
3. **Dataset-Specific Performance**: Correlation scores across 8 diverse datasets
4. **Score Calibration**: Comparing average judge scores vs human scores

## 🗂️ Benchmark Datasets

Judge's Verdict evaluates on 6 carefully selected datasets:

- **CORAL**: Complex reasoning and analysis questions
- **DC767**: Domain-specific technical queries
- **EKRAG**: Business and enterprise Q&A
- **HotpotQA**: Multi-hop reasoning tasks
- **SQuAD**: Reading comprehension
- **TechQA**: Technical documentation Q&A

## 🤝 Contributing

We welcome contributions! You can help by:
- Adding new judge models to the leaderboard
- Improving the evaluation methodology
- Adding new datasets or metrics
- Enhancing visualizations

## 📬 Contact

- **Issues**: [GitHub Issues](https://github.com/NVIDIA/judges-verdict-internal/issues)
- **Discussions**: [HuggingFace Community](https://huggingface.co/spaces/NVIDIA/judges-verdict/discussions)

## 🙏 Acknowledgments

This work builds on the collaborative efforts of the AI evaluation community. Special thanks to all contributors and dataset creators.