The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution

ACL ARR 2025 July Submission86 Authors

22 Jul 2025 (modified: 28 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, self-execution, evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: 9

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 3.

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Comment on page 3

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Section 3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: We used an API, we provided all information in section 4.

C2 Experimental Setup And Hyperparameters: N/A

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 4.

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 86

Loading