Keywords: Tool-integrated Reasoning, Code-integrated Reasoning, TIR Benchmark, Large Language Model Evaluation, LLM Benchmark, Large Language Model
Abstract: We introduce TIR-Bench, a benchmark designed for evaluating Tool-integrated Reasoning (TIR) in large reasoning models (LRMs). TIR-Bench addresses the limitations of existing TIR evaluations, such as narrow task coverage and a lack of fine-grained analysis. Our benchmark spans various domains, including number theory, cryptography, and neuro-symbolic tasks, requiring tool usage for all tasks to effectively decouple intrinsic reasoning abilities from TIR capabilities. It also incorporates automated failure mode analysis, offering insights into model performance. We employ an automated, bottom-up pipeline to generate complex tasks composed of atomic tasks represented in Directed Acyclic Graphs (DAGs). The results of evaluations on multiple LLMs highlight the varying TIR capabilities across models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Tool-integrated Reasoning, Code-integrated Reasoning, TIR Evaluation, TIR Benchmark
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 5133
Loading