Keywords: The Bitter Lesson, Human Expertise, Large Language Models, LLMs
TL;DR: We argue that with the emergence of large language models, it is time to recognize human expertise as data in AI.
Abstract: Artificial intelligence (AI) and machine learning (ML) have long treated data as clean numeric features and labels, with progress driven by ever‐larger models and datasets, a view that is crystallized in Sutton’s “Bitter Lesson”. In this paper, we contend that human expertise, often encoded in natural language, mathematical formalisms, and software, should itself be regarded as a vital form of data. First, we survey physics-informed ML, geometric deep learning, and safe reinforcement learning to show how embedding expert knowledge narrows hypothesis spaces, reduces sample and computational complexity, and improves out-of-distribution generalization. Next, we trace the expanding scope of data in ML, demonstrating how integrating text, images, actions, and other data modalities can transform previously transductive learners into increasingly inductive ones. We then highlight large language models (LLMs) as the nexus of these trends, illustrating how reinforcement learning with human feedback and in-context learning let LLMs integrate human expertise as data for general-purpose computation. To measure current practice, we analyze 1,000 NeurIPS papers between 2020–2024, finding that explicit domain-expert integration remains low with 12–18%, while LLM-based methods for expert incorporation are surging from 1% in 2022 to 8\% in 2024. We revisit the Bitter Lesson amid slowing Moore’s Law and real-world, non-i.i.d. data challenges, survey alternative perspectives, and propose new directions for dataset documentation, model design, and curated knowledge repositories. By recognizing human domain expertise and insights about tasks as first-class data, we envision a foundation for the development of more efficient and powerful AI.
Supplementary Material: zip
Submission Number: 495
Loading