# CoT Verifier Accuracy - Task-Specific Prompts

This directory contains task-specific prompt customizations for the `cot_verifier_accuracy` evaluation metric.

## Overview

The `cot_verifier_accuracy` metric tests whether a model can correctly extract the answer by following reasoning traces without the original question. By default, it uses a generic prompt, but you can customize the prompt per task by creating a YAML configuration file here.

## Structure

Each task can have an optional YAML file: `{task_name}.yaml`

### Configuration Schema

```yaml
# Optional: Custom instructions specific to this task
custom_instructions: |
  Additional instructions for this specific task.
  These will be inserted after the base job description.

# Optional: Few-shot examples showing expected behavior
examples:
  - reasoning: "Example reasoning trace 1"
    expected_answer: "Expected answer 1"
    notes: "Optional explanation of what's important"
  - reasoning: "Example reasoning trace 2"
    expected_answer: "Expected answer 2"

# Optional: Evaluation criteria or guidance
evaluation_criteria: |
  - Criterion 1
  - Criterion 2

# Optional: Override the default format instructions
# If not provided, uses FORMATING_PROMPT from prompt_manager.py
format_override: |
  Your custom format instructions here.
```

## How It Works

1. When `cot_verifier_accuracy` runs, it calls `create_answer_removed_explanation_question_to_own_question_prompt()`
2. The method checks if a task-specific config exists: `evaluate_efficient/prompts/cot_verifier/{task_name}.yaml`
3. If the config exists, it builds a modular prompt with:
   - Base template (always included)
   - Custom instructions (if provided)
   - Examples (if provided)
   - Evaluation criteria (if provided)
   - Format instructions (custom or default FORMATING_PROMPT)
4. If no config exists, it falls back to the generic prompt (backward compatible)

## Example: mini_sudoku.yaml

```yaml
custom_instructions: |
  For Mini Sudoku tasks, pay special attention to:
  - The grid structure must be valid
  - Each row, column, and 2x2 subgrid must contain 1-4 exactly once
  - Follow the step-by-step reasoning to fill in each cell

examples:
  - reasoning: |
      Row 1 has 4 in column 2. Missing 1,2,3.
      Column 1 has 3,4 so needs 1,2.
      Subgrid constraint forces row1,col1 = 2.
    expected_answer: |
      2 4 3 1
      3 1 2 4
      4 3 1 2
      1 2 4 3
    notes: "Shows how to reason about grid constraints"

format_override: |
  Format your response as the puzzle above, with spaces separating each number within a row, and newlines separating rows.
```

## Creating a New Task Configuration

1. Create `{task_name}.yaml` in this directory
2. Add only the sections you need to customize (all fields are optional)
3. Test your evaluation to ensure the prompt works as expected
4. The system falls back gracefully if the config is malformed

## Backward Compatibility

- Tasks without config files continue to use the generic prompt
- Existing FORMATING_PROMPT dictionary still works
- No code changes needed to add/remove task configs
- All existing evaluations continue to work unchanged

## Tips

- Start simple: add just `custom_instructions` first
- Use `examples` sparingly - too many can confuse the model
- Test with a few samples before running full evaluations
- You can temporarily disable a config by renaming it (e.g., `mini_sudoku.yaml.disabled`)
