CALM: Critic Automation with Language Models

Michael Y. Li; Vivek Vajipey; Noah Goodman; Emily Fox

CALM: Critic Automation with Language Models

Michael Y. Li, Vivek Vajipey, Noah Goodman, Emily Fox

28 Sept 2024 (modified: 28 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: automatic scientific discovery, model criticism

TL;DR: automatic methods for criticizing scientific models

Abstract: Understanding the world through models is a fundamental goal of scientific research. While large language model (LLM) based approaches show promise in automating scientific discovery, they often overlook the importance of criticizing scientific models. Criticizing models deepens scientific understanding and drives the development of more accurate models. Moreover, criticism can improve the reliability of LLM-based scientist systems by acting as a safeguard against hallucinations. Automating model criticism is difficult because it traditionally requires a human expert to define how to compare a model with data and evaluate if the discrepancies are significant--both rely heavily on understanding the modeling assumptions and domain. Although LLM-based critic approaches are appealing, they introduce new challenges: LLMs might hallucinate the critiques themselves. Motivated by this, we introduce CALM (Critic Automation with Language Models). CALM uses LLMs to generate summary statistics that highlight discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance. We can view CALM as a verifier that validates models and critiques by embedding them in a hypothesis testing framework. In experiments, we evaluate CALM across key quantitative and qualitative dimensions. In settings where we synthesize discrepancies between models and datasets, CALM reliably generates correct critiques without hallucinating incorrect ones. We show that both human and LLM judges consistently prefer CALM’s critiques over alternative approaches in terms of transparency and actionability. Finally, we show that CALM's critiques enable an LLM scientist to improve upon human-designed models on real-world datasets.

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12781

Loading