FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

Haorui Chen; Chengze Li; Jia Li

FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

Haorui Chen, Chengze Li, Jia Li

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language models, Natural language processing, Software engineering, Vibe coding

Abstract: The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as “vibe coding,” where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent’s vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: **❶ Pure Natural Language Prompts.** Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. **❷ A Rigorous & Evolving Data Collection Process.** FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. **❸ Comprehensive Test Cases.** Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. **❹ Diverse Application Domains.** The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for “aggressive implementation,” a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research. Our code is available at https://anonymous.4open.science/r/FeatBench-D3C5. “The hottest new programming language is English.” —Andrej Karpathy (Karpathy, 2025b)

Primary Area: datasets and benchmarks

Submission Number: 23391

Loading