everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
This paper focuses on test-driven development (TDD) tasks, where test cases act as both instruction and verification for LLM code generation. We build a TDD benchmark to evaluate frontier models, where reasoning models of OpenAI achieve SOTA. We identify instruction following and in-context learning as the critical abilities for all models to succeed at TDD tasks. We further reveal their vulnerabilities to long instructions as an area of improvement.