Meta-Reinforcement Learning for Compiler Optimization: A Kernel-Embedded CompilerLLM with Verified Assumptions and Practical Guarantees

Ellen Yi-Ge; Heng Huang; Taric Chen

Meta-Reinforcement Learning for Compiler Optimization: A Kernel-Embedded CompilerLLM with Verified Assumptions and Practical Guarantees

Ellen Yi-Ge, Heng Huang, Taric Chen

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Meta-Normalization, Reinforcement Learning

Abstract: Modern compilers carry out sophisticated transformation passes but they predominantly use static heuristics to come to optimisation decisions. In this work, we introduce GMPO which treats each compilation instance as a different task that is located on the similarity manifold and utilizes a kernel embedding for experiential knowledge transfer among related programs. We propose a Cross-Group Meta-Normalization (CG-MN) scheme that aggregates the gradient information from intra-batch neighbours specified by a normalised similarity operator and design a surprise-aware reward modulation mechanism for selectively amplifying learning signals for rather atypical yet successful compiler transformations. All theoretical assertions are expressed under express assumptions and within the sphere of a conservative scope. Specifically: (i) a result of a batch-centred mean square variance reduction of gradient descent (CG-MN) is provided, in which the demeaned component converges to zero as the square of the magnitude of the non-trivial eigenvalue. (ii) a local kernel local dynamics (KL) constrained performance lower bound is provided for natural gradient dynamics. (iii) independence performance accentuating on PAC. GMPO is provided with a 7 billion parameter code model trained on 5894 C programs while operating at the assembly level in a validator guard constrained action space programming with peephole rewrites, instruction substitutions, address mode changes and basic block local scheduling. On a held-out suite of 250 programs, we achieve compilation success for 246/250 examples (98.4%), test passes for 244/250 examples (97.6%), and a median speedup of 1.53x compared to GCC-O3 using a protocolized measurement pipeline. Performance from ablation experiments shows that CG-MN and surprise modulation are both used to improve overall performance.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 9889

Loading