Keywords: Code-LLMs, Multi-Turn, Benchmark, Functional Correctness, Security
Abstract: While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 30 open- and closed-source models on MT-Sec and observe a consistent 15-20% drop in “correct & secure" outputs from single-turn to multi-turn settings–even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation–an unexplored yet practically relevant setting–and find that models have increased rates of functionally incorrect and insecure outputs. Finally, we analyze agent scaffolding in multi-turn generation, finding that while it improves correctness, it can sometimes come at the cost of security. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
Submission Number: 178
Loading