LLM-ORBench: Designing a Benchmark Dataset for Complex Ontology-Based Reasoning Tasks in Large Language Models

Julie Loesch; Rishabh Jakhar; Raghava Mutharaju; Michel Dumontier; Remzi Celebi

LLM-ORBench: Designing a Benchmark Dataset for Complex Ontology-Based Reasoning Tasks in Large Language Models

Julie Loesch, Rishabh Jakhar, Raghava Mutharaju, Michel Dumontier, Remzi Celebi

19 Sept 2025 (modified: 24 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neurosymbolic Artificial Intelligence, Benchmark, Ontology Reasoning, Large Language Model

TL;DR: We present LLM-ORBench, a systematic benchmark framework for assessing large language models on ontology-based reasoning tasks.

Abstract: Large Language Models (LLMs) are increasingly applied to tasks requiring complex reasoning, yet their capabilities in formal logical reasoning remain underexplored. Existing benchmarks often focus on pattern recognition and fail to adequately assess symbolic reasoning, abstraction, or noise handling. To address this, we introduce \textit{LLM-ORBench}, a benchmark framework for evaluating LLMs on structured, ontology-based tasks with verifiable multi-step inferences generated by a symbolic reasoner. The framework combines natural language and formal SPARQL questions, and systematically removes domain knowledge (i.e., abstraction) to isolate formal logical reasoning. We evaluated \textit{GPT-5-mini}, \textit{DeepSeek-V3-0324}, and \textit{LLaMA-4-Maverick-17B-128E-Instruct} on two ontologies—\textit{Family} and \textit{OWL2Bench}—across binary and open-ended question-answering tasks. Our results show that reasoning complexity, abstraction, and question type strongly affect accuracy and reliability, with reasoning on abstracted tasks producing low accuracy and overconfidence, and open-ended tasks exhibiting substantial hallucination rates.

Primary Area: datasets and benchmarks

Submission Number: 19950

Loading