Can LLMs really reason about Code? Studying how well LLMs understand the relation between Input, Code, and Output

Norman Becker; Tural Mammadov; Andreas Zeller

Can LLMs really reason about Code? Studying how well LLMs understand the relation between Input, Code, and Output

Norman Becker, Tural Mammadov, Andreas Zeller

Published: 28 Mar 2026, Last Modified: 28 Mar 2026AIware 2026EveryoneRevisionsCC BY-SA 4.0

Keywords: Learning from demonstrations, Software testing and debugging, Neural networks, code generation, behavior modeling, reverse engineering

TL;DR: We let LLMs predict one of Code, Inputs, or Outputs from the other two and see how well this works

Abstract: In the past years, large language models (LLMs) have demonstrated remarkable progress in code generation. However, their ability to reason about program behavior remains an open challenge - an ability that is relevant for applications including reverse engineering, debugging, secure code generation, test-driven synthesis, input reconstruction, reverse fuzzing, behavioral monitoring, and safe execution modeling. To study this ability, we examine the capacity of LLMs to reason about the _semantics of code_ - specifically, their ability to \emph{relate} code, its inputs, and its outputs to each other. To this end, we investigate whether and how well LLMs can predict one of these three components given the other two - that is, 1. predict the _input_ given code and output, 2. predict the _output_ given code and input, and 3. predict the _code_ given input and output. This way, we assess how well LLMs can reason about and understand the underlying relationships that govern program execution. We construct four datasets covering string processing, array operations, and coding challenges in JavaScript and Python to evaluate diverse program-understanding capabilities, incorporating various code mutation techniques to increase complexity. In our evaluation on tasks covering string processing, array operations, and coding challenges, we find that _closed-weight models_ achieve the strongest performance across all datasets, including perfect input recovery on deterministic string tasks. Across tasks, _output prediction_ is comparatively stable, whereas _code prediction_ remains the hardest setting and often fails for smaller models. Finally, _cross-codebase_ transfer is feasible, especially for input prediction, but highly sensitive to model capacity and fine-tuning strategy.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public.

Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages

Reroute: false

Submission Number: 61

Loading