How Can LLMs Serve as Experts in Malicious Code Detection? A Graph Representation Learning Based Approach
Keywords: Large lanaguage model, Code, Graph Representation Learning
Abstract: Large Language Models (LLMs) excel in code processing yet encounter challenges in malicious code detection, primarily due to their limited ability to capture long-range dependencies within large and complex codebases. To address this limitation, we propose a graph representation learning-based attention acquisition framework to enhance LLMs’ malicious code detection capabilities. Specifically, our method constructs a graph representation of the codes, extracts semantic and structural features using an LLM, and trains a Graph Neural Network (GNN) with minimally labeled data. The GNN performs an initial detection and, through backtracking of its predictions, identifies key code sections that are most likely to contain malicious behavior. These identified sections then guide the attention of the LLM for in-depth analysis. By concentrating the LLM’s processing on these critical regions, our approach reduces the interference of redundant or irrelevant data, thereby improving detection accuracy and efficiency while maintaining low annotation costs. Extensive evaluation on both custom-built and public datasets demonstrates that our approach consistently outperforms existing detection methods, highlighting its potential for practical deployment in software security scenarios.
Supplementary Material: zip
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 5505
Loading