# Submission Format

To submit your system to the CKM Benchmark, produce two artifacts:

1. **A per-topic summary JSON** at `results/<your-system>_summary.json`.
2. **Per-topic hypothesis files** following the schema below (released alongside or kept private; the leaderboard requires only the summary JSON, but spot-check audits may request the hypothesis files).

## 1. Summary JSON (required)

A single JSON file containing one record per topic. Each record must include the following fields (additional fields are permitted but ignored by the leaderboard):

```json
[
  {
    "slug": "long_context_understanding_in_large_language_models",
    "name": "Long-context understanding in large language models",
    "duration": 2851.4,
    "yield": 17,
    "hit_rate": 17.6,
    "best_match_score": 4.46,
    "unique_hit_papers": 7,
    "init_tokens": 53000,
    "evolution_tokens": 442000,
    "total_generation_tokens": 495000
  },
  ...
]
```

Field definitions:

| Field | Type | Definition |
|---|---|---|
| `slug` | string | Topic identifier; must match `data/topics.csv` |
| `name` | string | Human-readable topic name |
| `duration` | float | Wallclock seconds for the full per-topic run |
| `yield` | int | Total hypotheses your system generated for this topic |
| `hit_rate` | float | Percentage of hypotheses that hit (judged by the canonical judge) |
| `best_match_score` | float | Mean of best-match alignment scores across hypotheses |
| `unique_hit_papers` | int | Distinct future papers your system's hits matched |
| `init_tokens` | int | Tokens consumed in the initialization phase |
| `evolution_tokens` | int | Tokens consumed across all evolution windows |
| `total_generation_tokens` | int | Sum of generation tokens (excludes judge) |

The summary must cover all 50 topics in `data/topics.csv`. Missing topics are treated as zero-yield.

## 2. Hypothesis schema (recommended)

Each generated hypothesis should follow this schema (used by the canonical judge for re-grading and by readers reviewing your submission):

```yaml
# hyp-2024-11-015.md
## Statement
A novel RLHF framework will integrate adaptive entropy regularization
(H-DPO/SEE-DPO style) with theoretically grounded KL-regularization
and dynamic uncertainty-aware policy optimization, demonstrating enhanced
stability and reduced reward hacking compared to standard DPO or PPO.

## Research Claim
- Problem: RLHF and DPO methods often struggle with reward overoptimization,
  mode collapse, and instability...
- Method Delta: Integrate adaptive entropy regularization with theoretically
  grounded KL-regularization and dynamic uncertainty-aware policy optimization.
- Target Setting: LLM alignment (math, code, instruction following) and
  text-to-image diffusion alignment.
- Baseline: Standard DPO or PPO-based RLHF with fixed KL regularization.
- Expected Observable: Smoother reward curves, lower rates of repetitive/
  nonsensical high-reward outputs, higher pass@k, better diversity metrics.
- Evaluation Plan: Implement and evaluate on GSM8K, HumanEval, MMLU-Pro,
  IFEval (LLMs) and Pick-a-Pic-V1 (diffusion); ablate the regularization
  components.
- Failure Mode: Interplay between multiple regularizers may introduce
  hyperparameter tuning challenges.

## Source Papers
- arXiv:2411.07595 -- Entropy Controllable Direct Preference Optimization (H-DPO)
- arXiv:2411.04712 -- SEE-DPO: Self Entropy Enhanced Direct Preference Optimization
- arXiv:2411.04625 -- Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
- arXiv:2403.05171 -- Overcoming Reward Overoptimization via Adversarial Policy Optimization

## Trigger
- Type: CONVERGENCE
- Source: Multiple recent papers addressing RLHF stability via adaptive regularization.

## Self-Assessment
- Novelty: 4
- Feasibility: 3
- Impact: 5
```

The schema is Markdown for human readability and machine-parseable via simple regex. A JSON-equivalent representation is also accepted; see `examples/example_hypothesis.json`.

## Submission process

1. Fork the benchmark repository.
2. Add your summary JSON under `results/<your-system>_summary.json`.
3. Add a one-paragraph entry to `docs/LEADERBOARD.md` describing your system.
4. Open a pull request.
5. Maintainers run the canonical judge on your summary's hit lists and confirm or revise the reported numbers; only re-judged numbers go on the leaderboard.

## Validity requirements

A submission is rejected if:

- It uses any validation-phase paper as input to generation (data leakage).
- It uses a generation model whose training cutoff post-dates the validation period (memorization risk).
- It uses the same model family for generation and judging (self-preference risk).
- The summary JSON's per-topic numbers cannot be reproduced from the per-hypothesis files when re-judged.

The first three are usually documented in the submission's accompanying `README` and verified via spot-check. The last is enforced by the canonical judge re-running on submitted hits.
