Refining Critical Thinking in LLM Code Generation: A Faulty Premises-Based Evaluation Framework

Refining Critical Thinking in LLM Code Generation: A Faulty Premises-Based Evaluation Framework

ACL ARR 2026 January Submission1024 Authors

27 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code LLMs; Large language model; Benchmark

Abstract: With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework specifically targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, we conduct in-depth assessments of 15 representative LLMs. Our key findings are: (1) Most models exhibit poor reasoning abilities and suboptimal code generation under faulty premises, heavily relying on explicit prompts for error detection; (2) Faulty premises trigger a point of diminishing returns in resource investment—blindly increasing response length fails to enhance quality; (3) The three types of faulty premises activate distinct defect patterns in models, revealing a triple dissociation in their cognitive mechanisms. This study highlights the need for LLMs to proactively verify premises in code generation, and through FPBench provides a theoretical foundation and practical pathway for developing reliable, human-centric code generation models.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Evaluation methods; Code models; Human evaluation

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English; Python

Submission Number: 1024

Loading