STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Krishnamurthi Raman; Taylor Lundy; Thiago Amin; Kevin Leyton-Brown; Jesse Perla

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Krishnamurthi Raman, Taylor Lundy, Thiago Amin, Kevin Leyton-Brown, Jesse Perla

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, economics, microeconomics, agents, mcqa, free-text

Abstract: Large language models (LLMs) are increasingly being asked to make economically rational decisions and indeed are already being applied to economic tasks like stock picking and financial analysis. Existing LLM benchmarks tend to focus on specific applications, making them insufficient for characterizing economic reasoning more broadly. In previous work, we offered a blueprint for comprehensively benchmarking $\textit{strategic}$ decision-making Raman et al. 2024. However, this work did not engage with the even larger microeconomic literature on $\textit{non-strategic}$ settings. We address this gap here, taxonomizing microeconomic reasoning into $58$ distinct elements, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. By generating fresh questions for each element, auto-STEER induces diversity which could help to reduce the risk of data contamination. We use this benchmark to evaluate $27$ LLMs spanning a range of scales and adaptation strategies, comparing performance across multiple formats—multiple-choice and free-text question answering—and scoring schemes. Our results surface systematic limitations in current LLMs' ability to generalize economic reasoning across types, formats, and textual perturbations, and establish a foundation for evaluating and improving economic competence in foundation models.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/narunraman/steer_me

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1140

Loading