FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: llm, test-time compute, llm evaluation
TL;DR: We present Fair Evaluation Protocol for Test-Time Compute to enable fair and consistent evaluation of test-time compute techniques.
Abstract: The performance of Large Language Models (LLMs) and the associated dollar costs of API calls can fluctuate over time, potentially invalidating conclusions drawn in prior research. To address this, we propose a _**F**air **Eval**uation protocol for **T**est-**T**ime **C**ompute_ (FEval-TTC), designed to ensure consistent assessment of test-time compute (TTC) methods, regardless of such fluctuations. FEval-TTC focuses on evaluation of TTC methods that utilize underlying Chains-of-Thought (CoT). It supports evaluations across multiple LLMs on a diverse set of mathematical and commonsense reasoning datasets. The few-shot prompting and answer extraction processes are standardized across datasets, reducing both time and monetary overhead for researchers. Furthermore, we provide a cost modeling procedure that estimates both the token and dollar cost per query, facilitating equitable comparisons of prevalent TTC methods. We open-source FEval-TTC for public use at [anonymized code link](https://drive.google.com/file/d/1DUeteFA7lnx5MubuR0lh6OPN6XKfpqGC/view?usp=sharing).
Submission Number: 133
Loading