---
task_categories:
- question-answering
- text-generation
license: mit
language:
- en
---

# Cost-of-Pass: An Economic Framework for Evaluating Language Models

This dataset contains benchmark records of the evaluations in our paper.

## 📌 Intended Use

The dataset is shared to support reproducibility of the results and analyses presented in our paper. For detailed instructions on how to replicate our results and analyses, please refer to our code-base.

## 🗂️ Dataset Structure

### Directory Layout

Benchmark record folders are organized as:

```
dataset_name/model_name/inference_time_method/
```

Within each such directory you will find:

- **full_records/**: All raw records from model runs  
- **metric_records/**: Record evaluations using a specific metric  
- **metadata.json**: High-level summary including the number of records, completed runs, and metadata stats  

---

### 📄 Record Format

Both full_records and metric_records share the following core fields:

| Field                       | Type        | Description                                               |
| --------------------------- | ----------- | --------------------------------------------------------- |
| model_name                  | str         | Identifier for the model used                             |
| task_name                   | str         | Identifier for the evaluated task                         |
| tt_method_name              | str         | Inference-time method (e.g., VanillaPromptMethod, SelfRefinementMethod)         |
| input_idx                   | int         | Index for the problem instance (of the task)                      |
| answer                      | str         | Model's final answer                                      |
| num_input_tokens            | int         | Token count for the problem input                         |
| num_prompt_tokens           | int         | Token count for the full prompt(s)                           |
| num_completion_tokens       | int         | Total number of tokens generated                          |
| num_answer_tokens           | int         | Token count of the final answer                           |
| cost_per_prompt_token       | float       | Cost per prompt token (incurred by the model)                                    |
| cost_per_completion_token   | float       | Cost per completion token (incurred by the model)                                 |
| completed                   | bool        | Whether the run / evaluation completed successfully                    |
| timestamp                   | float       | Generation timestamp                                      |
| uid                         | str         | Unique identifier for the record                          |

#### Fields Exclusive to full_records

| Field       | Type        | Description                         |
| ----------- | ----------- | ----------------------------------- |
| input       | str         | Problem input (description)         |
| target      | str         | Ground-truth answer               |
| prompts     | List[str]   | Prompts used during interaction     |
| responses   | List[str]   | Model responses across interactions |
| metadata    | dict        | Additional metadata about runs / evaluation|

#### Fields Exclusive to metric_records

| Field           | Type    | Description                                     |
| --------------- | ------- | ----------------------------------------------- |
| metric_name     | str     | Name of the evaluation metric                   |
| metric_score    | float   | Score from the metric (1 = correct, 0 = wrong)  |

---