# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

SCRIPT_CODE = """
# Evaluation Script

To evaluate your LLM-as-a-judge, you must use the [judges-verdict-internal](https://github.com/NVIDIA/judges-verdict-internal) repository:

```bash
# Clone the evaluation framework
git clone https://github.com/NVIDIA/judges-verdict-internal
cd judges-verdict-internal

# Install dependencies
pip install -e .

# Run evaluation
python -m llm_judge_benchmark.scoring.llm_judge_scoring \\
    --config your_judge_config.yaml \\
    --output-dir results/your-model-name/  # Must match config identifier
```

**Important:** The output directory name must exactly match your model identifier in the config file.

For detailed instructions and additional parameters, visit our [evaluation guide](https://github.com/NVIDIA/judges-verdict-internal).
"""

TITLE = "<h1 style='text-align: center; font-size: 40px;'>⚖️ Judge's Verdict: Benchmarking LLM as a Judge</h1>"

INTRO_TEXT = """
<div style='text-align: center; margin: 20px 0;'>
<p style='font-size: 20px; margin-bottom: 15px;'>
<strong>Judge's Verdict</strong> is a comprehensive benchmark for evaluating how well LLM judges align with human preferences when assessing AI-generated responses.
</p>
</div>
"""

MOTIVATION_TEXT = """
# 💡 Why Judge's Verdict?

As LLMs are increasingly used to evaluate other AI systems, understanding their alignment with human judgment becomes critical. **Judge's Verdict** provides:

- 📊 **Comprehensive Metrics**: Correlation analysis, Cohen's Kappa, and outlier detection to measure judge-human alignment
- 🎯 **Multi-Domain Coverage**: Evaluation across diverse datasets including technical Q&A, factual accuracy, and creative tasks
- 🔍 **Detailed Analysis**: Insights into where LLM judges agree and disagree with human annotators
- 🚀 **Easy Integration**: Simple APIs for evaluating new judge models

Our benchmark helps researchers and practitioners:
- Select the most human-aligned LLM judges for their use cases
- Understand the strengths and limitations of different judge models
- Develop better evaluation methods for AI systems
"""

SUBMISSION_INSTRUCTIONS = """
# 📝 How to Submit Your Judge Results

We welcome contributions to the Judge's Verdict leaderboard! Submit your LLM judge results by following these steps:

## 1. 🔧 Prepare Your Judge Model

Configure your LLM judge in a YAML configuration file:

```yaml
models:
  your-judge-identifier:  # This identifier MUST match your results folder name
    framework: litellm  # or "openai", "anthropic", etc.
    model: provider/model-name  # litellm compatible with format, e.g., "openai/gpt-4o", "nvidia_nim/meta/llama-3.1-70b-instruct"
    temperature: 0.0
    max_tokens: 8
    num_workers: 16
```

**Important:** The judge identifier must **exactly match** the folder name where your results will be stored.

## 2. ▶️ Run Evaluation

To evaluate your LLM judge, please following the instructions in the [judges-verdict-internal](https://github.com/NVIDIA/judges-verdict-internal) repository.

## 3. 📤 Submit Results

1. **Fork** this Hugging Face Leaderboard repository
2. **Add** your results to `benchmark/judge_results/your-judge-identifier/`
   - The folder name must **exactly match** your judge identifier from the config
3. **Include** the following files:
   - `trial1.json`, `trial2.json`, etc. - Raw scoring results from each trial
4. **Create a PR** with title: `Add [Your Judge Identifier] judge results`

**Example Structure:**
```
benchmark/judge_results/
├── gpt-4o/              # Judge identifier from config
│   ├── trial1.json
│   ├── trial2.json
│   └── trial3.json
└── your-judge-identifier/     # Your judge identifier
    ├── trial1.json
    └── trial2.json
```

## 📋 Requirements

- Use [judges-verdict-internal](https://github.com/NVIDIA/judges-verdict-internal) for evaluation
- Evaluation on the complete Judge's Verdict dataset
- Valid judge configuration file with matching model identifier

Questions? Open an issue or contact us!
"""

ABOUT_TEXT = """
## 🎯 About Judge's Verdict

**Judge's Verdict** is a comprehensive benchmark designed to evaluate the alignment between LLM judges and human preferences. As LLMs increasingly serve as evaluators in AI systems, understanding their judgment quality becomes crucial.

### Key Features:

- **📊 Human Alignment Metrics**: Measure how closely LLM judges correlate with human annotators
- **🔍 Multi-Dimensional Analysis**: Evaluate agreement using Pearson correlation, Cohen's Kappa, and outlier detection
- **🌐 Diverse Datasets**: Test across multiple domains including factual accuracy, reasoning, and creative tasks
- **📈 Transparent Evaluation**: Full visibility into judge performance across different question types

### Why It Matters:

Using LLMs as judges offers scalability and consistency, but their alignment with human judgment varies. Judge's Verdict helps:
- **Select Better Judges**: Choose models that best align with human preferences for your use case
- **Understand Limitations**: Identify where LLM judges diverge from human assessments
- **Drive Research**: Advance the development of more human-aligned evaluation methods

### Evaluation Dimensions:

1. **Correlation Analysis**: How well do judge scores correlate with human ratings?
2. **Inter-Rater Agreement**: Using Cohen's Kappa to measure agreement beyond chance
3. **Outlier Detection**: Identifying cases where judges significantly disagree with humans
4. **Dataset-Specific Performance**: Understanding judge performance across different domains
"""

CITATION_TEXT = """
## 📚 Citation

If you use Judge's Verdict in your research, please cite:

```bibtex
@misc{judgesverdict2025,
  author = {Steve Han and Gilberto Titericz Junior and Tom Balough and Wenfei Zhou},
  title = {Judge's Verdict: Benchmarking LLM-as-a-Judge Alignment with Human Preferences},
  year = {2025},
  url = {https://github.com/nvidia/judges-verdict},
  note = {Version 1.0.0}
}
```

**Links**: 
- [GitHub Repository](https://github.com/NVIDIA/judges-verdict-internal)
- [Hugging Face Space](https://huggingface.co/spaces/NVIDIA/judges-verdict)
- [Research Paper](coming soon)
"""
