TheoremQA: A Theorem-driven Question Answering Dataset

Wenhu Chen; Ming Yin; Max Ku; Pan Lu; Yixin Wan; Xueguang Ma; Jianyu Xu; Xinyi Wang; Tony Xia

TheoremQA: A Theorem-driven Question Answering Dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Resources and Evaluation

Submission Track 2: Question Answering

Keywords: Language Model, Question Answering, Math, Theorem

TL;DR: We collect the first theorem-driven question answering dataset to investigate the capabilities of large language models to understand and integrate STEM theorems to solve real-world problems.

Abstract: The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems.

Submission Number: 1768

Loading