IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer

Zi Yang; Lei Qiu; FANG LYU; Ming Zhong; Zhilei Chai; Haojie Zhou; Huimin Cui; Xiaobing Feng

IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer

Zi Yang, Lei Qiu, FANG LYU, Ming Zhong, Zhilei Chai, Haojie Zhou, Huimin Cui, Xiaobing Feng

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compiler Optimization, LLVM IR, Large Language Models (LLMs), Dataset

TL;DR: We present IR-OptSet, a public LLVM IR dataset tailored for optimization-sensitive LLM training, significantly improving compiler code generation performance.

Abstract: Compiler optimization is essential for improving program performance, yet modern compilers still depend on manually crafted transformation rules over intermediate representations (IRs). As compilers grow in complexity, maintaining these rule-based optimizations becomes increasingly labor-intensive and difficult to scale. Recent advances in large language models (LLMs) offer a promising alternative, but their effectiveness in compiler optimization remains limited—primarily due to the lack of IR-oriented datasets that expose models to diverse transformation samples in real-world scenarios (*optimization-sensitive samples*), hindering LLMs from learning rich and generalizable optimization strategies. In this paper, we introduce IR-OptSet, the first public optimization-sensitive dataset for advancing LLM-based IR optimizers. It comprises 170K LLVM IR samples from open-source repositories across 8 representative optimization domains. IR-OptSet defines two core tasks: Code Analysis and Optimized Code Generation, and provides tools for correctness verification, performance evaluation, and dataset expansion. In our experiments, fine-tuning three representative LLMs on IR-OptSet leads to significant accuracy improvements across both tasks. Moreover, the LLM fine-tuned with IR-OptSet *outperforms traditional compiler with the -O3 option* in 64 test cases in terms of performance. Further analysis reveals that IR-OptSet provides greater transformation diversity and representativeness than three widely used IR-oriented datasets, highlighting its potential to drive model-based IR optimization. IR-OptSet is publicly available at [https://huggingface.co/datasets/YangziResearch/IR-OptSet](https://huggingface.co/datasets/YangziResearch/IR-OptSet).

Croissant File: json

Dataset URL: https://huggingface.co/datasets/YangziResearch/IR-OptSet

Code URL: https://github.com/yilingqinghan/IR-OptSet

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 732

Loading