NoCode-bench: Evaluating Natural Language-Driven Software Feature Addition

NoCode-bench: Evaluating Natural Language-Driven Software Feature Addition

ACL ARR 2026 January Submission4222 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Natural Language-driven Feature Addition, Benchmark, Large Language Model

Abstract: LLMs have demonstrated remarkable capabilities in supporting software developers, e.g., by automating code generation and code editing. In contrast, their effectiveness and limitations in enabling software users to incrementally improve a piece of software are currently underexplored. A promising paradigm toward this end is natural language-driven feature addition, which allows users to specify and modify software functionality purely through natural language (NL) descriptions, sometimes also called ``no-code development''. This paper introduces NoCode-bench, a benchmark designed to evaluate LLMs on real-world NL-driven software feature addition tasks. NoCode-bench consists of 634 tasks across 10 popular projects, each of which pairs a user-oriented documentation change and the corresponding code implementation that can be validated against developer-written test cases. To facilitate lightweight and reliable evaluation, we further curate a human-validated subset named NoCode-bench Verified. It covers 114 high-quality tasks across projects, where the task clarity and evaluation validity are manually verified. We use NoCode-bench to assess a range of state-of-the-art LLMs. Experimental results show that despite significant token consumption, the best task success rate remains as low as 37.72\% when using the OpenHands scaffold combined with Qwen3-Coder-480B. Our analysis reveals that LLMs face key challenges in performing cross-file edits, understanding modular design, and accurately calling tools.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 4222

Loading