TuringQ: Benchmarking AI Comprehension in Theory of Computation

TuringQ: Benchmarking AI Comprehension in Theory of Computation

ACL ARR 2024 June Submission1930 Authors

15 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present TuringQ, to the best of our knowledge, the first effort to evaluate the reasoning capabilities of large language models (LLMs) in the theory of computation. TuringQ consists of 4,006 question-answer pairs spanning undergraduate and graduate-level problems collected from a diverse set of universities. It covers three difficulty levels and six main concepts, including a valuable subset of axioms and essential theoretical concepts. We evaluated various open-source LLMs and GPT-4 using Chain of Thought prompting and human expert assessment. Additionally, we explored an automated LLM-Judge, demonstrating its potential to compete with human precision. We show that fine-tuning an LLaMA-3B model on TuringQ improves its reasoning ability. TuringQ serves as both a benchmark and a fine-tuning resource for enhancing LLM reasoning in this complex domain. Our comparative analysis reveals insights into LLM performance, contributing to advancements in AI comprehension of theoretical computer science. The dataset, code, and fine-tuned model will be made publicly available upon publication.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: NLP datasets, benchmarking, evaluation, fine-tuning, automatic evaluation of datasets

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English, Formal Languages

Submission Number: 1930

Loading