Can Large Language Models Understand Intermediate Representations in Compilers?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Evaluating LLMs’ ability to understand intermediate representations (IRs) from structural, syntactic, semantic, and reasoning perspectives.
Abstract: Intermediate Representations (IRs) play a critical role in compiler design and program analysis, yet their comprehension by *Large Language Models* (LLMs) remains underexplored. In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs—GPT-4, GPT-3, DeepSeek, Gemma 2, Llama 3, and Code Llama—in understanding IRs. Specifically, we assess model performance across four core tasks: *control flow graph reconstruction*, *decompilation*, *code summarization*, and *execution reasoning*. While LLMs exhibit competence in parsing IR syntax and identifying high-level structures, they consistently struggle with instruction-level reasoning, especially in control flow reasoning, loop handling, and dynamic execution. Common failure modes include misinterpreting branching instructions, omitting critical operations, and relying on heuristic reasoning rather than on precise instruction-level logic. Our findings highlight the need for IR-specific enhancements in LLM design. We recommend fine-tuning on structured IR datasets and integrating control-flow-sensitive architectures to improve the models’ effectiveness on IR-related tasks. All the experimental data and source code are publicly available at [https://github.com/hjiang13/LLM4IR](https://github.com/hjiang13/LLM4IR).
Lay Summary: Intermediate Representations (IRs) are a key component of modern compilers, enabling deep program optimization and analysis. At the same time, Large Language Models (LLMs) like have shown remarkable capabilities in understanding and generating high-level code. Motivated by this progress, we asked: **can these powerful models also understand IRs and be used for IR-level tasks?** To answer this, we designed four evaluation tasks — control flow reconstruction, IR decompilation, function summarization, and execution reasoning — and tested six leading LLMs on hundreds of real-world IR programs generated from C++ code. Our study reveals that while LLMs can handle surface-level syntax and recognize basic structures, they struggle with instruction-level reasoning, control flow logic, and program simulation. Models often rely on pattern matching instead of truly understanding the IR semantics. This matters because it highlights a key gap in applying LLMs to compiler technologies and low-level software analysis. By identifying where current models fall short, we offer guidance for future development, such as IR-specific training, structural prompting, and new evaluation strategies. Our work takes an important first step toward building trustworthy, LLM-powered tools that can assist with the low-level foundations of modern software systems.
Link To Code: https://github.com/hjiang13/LLM4IR
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models (LLMs), Intermediate Representations (IRs), Code Comprehension
Submission Number: 11636
Loading