Evaluating Inductive Reasoning Capabilities of Large Language Models With The One Dimensional Abstract Reasoning Corpus
Abstract: We present an initial automated test to evaluate LLMs’ capacity to perform inductive reasoning tasks. We use the GPT-3.5 and GPT-4 models to create a system which generates Python code as hypotheses for inductive reasoning to transform sequences of the One Dimensional Abstract Reasoning Corpus (1D-ARC) challenge. We experiment with three prompting techniques, namely standard prompting, Chain of Thought (CoT), and direct feedback. We provide results and an analysis of cost-to-success rate and benefit-cost ratio. Our best result is an overall 25% success rate with our CoT prompting on GPT-4, significantly surpassing the standard prompting approach. We discuss potential avenues to improve our experiments and test other strategies, and combine deductive reasoning with LLM-based inductive reasoning.
Loading