Compare-and-Correct: Test-Time Scaling for Scientific Coding Agents without Gold Verification Signals

ACL ARR 2025 May Submission2768 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Scientific coding tasks often require navigating vast, uncertain solution spaces, where initial hypotheses may be incomplete, flawed, or contradictory. A key challenge in automating scientific coding tasks is the lack of clear-cut success signals, such as gold labels or unit tests that are common in traditional programming or supervised learning tasks. In scientific domains, correctness is often context-dependent, diverse in form, and rarely captured by a single metric. This makes it difficult for agents to determine when a solution is sufficient or how to refine it. We introduce Compare-and-Correct (C\&C), a verifier-guided agent framework for scalable test-time trajectory exploration. Instead of relying on single-pass inference, C\&C leverages test-time compute scaling by generating a diverse set of candidate solutions and then iteratively refining them via self-debugging and self-improvement mechanisms. An Elo Rating-based verifier ranks candidates by relative quality, guiding the agent to backtrack, correct, and converge on the most promising solutions without relying on explicit success criteria. We demonstrate C\&C’s effectiveness across a range of tasks including machine learning engineering and visualization on ScienceAgentBench. Experiments show that C\&C significantly outperforms direct prompting, prior agents like OpenHands and Self-Debug, and alternative verifiers such as random selection and LLM-as-a-Judge. These results confirm the strength of our agent design and verification approach.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: LLM/AI agents; scaling; applications;
Languages Studied: English
Submission Number: 2768
Loading