Keywords: Natural Language-driven Feature Addition, Benchmark, Large Language Model
Abstract: LLMs have demonstrated remarkable capabilities in supporting software developers, e.g., by automating code generation and code editing.
In contrast, their effectiveness and limitations in enabling software users to incrementally improve a piece of software are currently underexplored.
A promising paradigm toward this end is natural language-driven feature addition, which allows users to specify and modify software functionality purely through natural language (NL) descriptions, sometimes also called ``no-code development''.
This paper introduces NoCode-bench, a benchmark designed to evaluate LLMs on real-world NL-driven software feature addition tasks.
NoCode-bench consists of 634 tasks across 10 popular projects, each of which pairs a user-oriented documentation change and the corresponding code implementation that can be validated against developer-written test cases.
To facilitate lightweight and reliable evaluation, we further curate a human-validated subset named NoCode-bench Verified.
It covers 114 high-quality tasks across projects, where the task clarity and evaluation validity are manually verified.
We use NoCode-bench to assess a range of state-of-the-art LLMs.
Experimental results show that despite significant token consumption, the best task success rate remains as low as 37.72\%
when using the OpenHands scaffold combined with Qwen3-Coder-480B.
Our analysis reveals that LLMs face key challenges in performing cross-file edits, understanding modular design, and accurately calling tools.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 4222
Loading