AutoAxiom: Automated Axiom Modification for Scientific Discovery with LLMs

19 Sept 2025 (modified: 30 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Axiom Modification, LLMs, Multi-agent, DSL, Benchmark
Abstract: Large language models (LLMs) have advanced fact-level reasoning, but remain largely untested on a deeper capability: modifying the \emph{axioms} that define a domain’s feasible solutions. Yet such rule revision—rather than mere optimization within fixed constraints—is often the engine of human invention. Existing benchmarks—accuracy tests, similarity metrics, and LLM-as-judge scoring—do not assess whether a model can propose, evaluate, and stabilize axiom changes. We present \emph{AutoAxiom}, an end-to-end framework for axiom modification with LLMs that unifies data, algorithms, and evaluation. We formalize a domain as $(A, S)$, with axiom set $A$ and success function $S$. Our insight is to \emph{modify axiom set $A$}---rather than maximize success function $S$ under fixed $A$---thereby reshaping the feasible region and exposing solutions unreachable under the original rules. Specifically, we present: (1)~\emph{AutoAxiom}, a DSPy-based multi-agent pipeline that generates, edits, and vets axioms with trained optimizers and classifiers, trained with a set of human-verified Axioms transformation; (ii) \emph{AxiomsEval}, a dual benchmark coupling scalable rubric-driven LLM judgments with objective simulator validation; and (iii) \emph{Axioms100}, a multi-domain corpus of human- and model-curated axiom modifications. Across four simulators and 100+ domains, AutoAxiom achieves the best scores on all simulator tasks with relative gains of 15–180\% over the next-best baseline, improves compatibility to 99.5\% (from 98.9\%). We will release code, datasets, operator definitions, prompts, simulators, and evaluation scripts upon publication.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19366
Loading