Keywords: Test Oracle Generation; Multi-Agent Collaboration; Software Testing
TL;DR: Existing AI test oracle generators are flawed, relying only on text specifications. We propose NEXUS, a multi-agent system that deliberates, then validates oracles against executable code and self-refines them, significantly boosting performance.
Abstract: Test oracle generation in non-regression testing is a longstanding challenge in
software engineering, where the goal is to produce oracles that can accurately
determine whether a function under test (FUT) behaves as intended for a given
input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of
specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase,
a panel of four specialist agents, each embodying a distinct testing philosophy,
collaboratively critiques and refines an initial set of test oracles. Then, in the
validation phase, Nexus generates a plausible candidate implementation of the
FUT and executes the proposed oracles against it in a secure sandbox. For any
oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks
demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the
LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45%
for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8825
Loading