KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Kaijing Ma; Xeron Du; Yunran Wang; Haoran Zhang; ZhoufutuWen; Xingwei Qu; Jian Yang; Jiaheng Liu; minghao liu; Xiang Yue; Wenhao Huang; Ge Zhang

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Kaijing Ma, Xeron Du, Yunran Wang, Haoran Zhang, ZhoufutuWen, Xingwei Qu, Jian Yang, Jiaheng Liu, minghao liu, Xiang Yue, Wenhao Huang, Ge Zhang

Published: 22 Jan 2025, Last Modified: 01 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning; Knowledge-Orthogonal; Rule-Based

TL;DR: We introduce the concept of Knowledge Orthogonal Reasoning (KOR) and propose five types of rule-based reasoning tasks to construct a KOR-Bench to fully evaluate the intrinsic reasoning ability of the model.

Abstract: In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88\% and 70.16\%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96\% and 58.00\%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot "only questions" experiments. KOR-Bench aims to enhance reasoning evaluation and support further research in this area.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11089

Loading