Exploiting Code Symmetries for Learning Program Semantics

Kexin Pei; Weichen Li; Qirui Jin; Shuyang Liu; Scott Geng; Lorenzo Cavallaro; Junfeng Yang; Suman Jana

Exploiting Code Symmetries for Learning Program Semantics

Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, Suman Jana

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: learning on graphs and other geometries & topologies

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Code Symmetry, Program Representation, Code Modeling, Group-Equivariance, Robustness

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We formalize learning code semantics as staying equivariant to semantics-preserving code transformations, and develop a group-equivariant self-attention layer achieving improved robustness and generalization over the state-of-the-art code LLMs.

Abstract: Large Language Models (LLMs) hold significant potential for automating program analysis, but current code LLMs face challenges in grasping program semantics. Our paper addresses this by formalizing program semantics through code symmetries and integrating them into LLM architectures for code analysis. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, enabling precise reasoning within LLMs. Our solution, SymC, employs a novel variant of group-equivariant self-attention that is provably equivariant to code symmetries. We extensively evaluate SymC on four program analysis tasks, comparing it to eight baselines against eight code transformations. Our results show that SymC generalizes to unseen code transformations, outperforming the state-of-the-art code models by 30.7%. SymC, by design, stays invariant to semantics-preserving permutations, while state-of-the-art code models like WizardCoder and GPT-4 violate these invariances at a high rate (i.e., 14% and 43%, respectively).

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8801

Loading