CONV-TO-BENCH: EVALUATING LANGUAGE MODELS VIA USER–ASSISTANT DIALOGUES IN CODE TASKS

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop VerifAI-2EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: Automated Benchmarking, Multi-turn Dialogue, Instructional Evolution, Requirement Checklists, LLM Evaluation
Abstract: The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn userassistant dialogues into structured, verifiable requirement checklists. By leveraging the ”instructional evolution” found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with humanauthored standards like BigCodeBench, achieving Spearman correlations of up to ρ = 1.000 with significantly lower computational overhead. Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation. Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Victor_Moreli_dos_Santos1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 38
Loading