Beyond Superficial Tests: Adversarial Refinement for Reliable Property-Based Testing

Beyond Superficial Tests: Adversarial Refinement for Reliable Property-Based Testing

ACL ARR 2026 January Submission7332 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: property-based testing (PBT), LLM agents, test generation, adversarial refinement, static analysis, mutation testing, Python, software testing, bug finding

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation, yet their application to Property-Based Testing (PBT) remains fraught with a “superficiality gap”. While LLMs can readily generate syntactically correct tests, they often struggle to bridge the semantic gap between code implementation and its intended invariant logic, resulting in weak properties that provide a false sense of security. To address this, we introduce PROBE, an agentic framework that hardens software properties through Adversarial Refinement. Unlike traditional generation approaches, PROBE treats test generation as a game of “semantic asymmetry”: it employs a Validator agent to actively generate counter-implementations, which are semantically incorrect codes that satisfy the generated property, to expose loopholes in the specification. Furthermore, PROBE constructs a cross-functional semantic graph to capture deep dependencies often missed by local analysis. Extensive evaluation reveals that PROBE increases mutation scores by 9.79% over baselines. In real-world deployment, PROBE identified 45 previously unknown bugs in top-tier libraries that have been confirmed by developers, demonstrating its ability to uncover deep semantic defects.

Paper Type: Long

Research Area: Code Models

Research Area Keywords: code generation, software engineering automation

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7332

Loading