Keywords: security, security patch detection, machine learning, vulnerability fixing commits, benchmark, code representation
TL;DR: Evaluation of the current state of code language models for classification of security patches.
Abstract: Detecting vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, yet advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories, driving
interest in automated detection methods. We present a comprehensive evaluation of transformer-based VFC detection through a unified framework consolidating 20 fragmented datasets spanning more than \num{180,000} commits. Our analysis reveals that high performance metrics mask fundamental limitations. Models achieve F1 scores of 0.9 when using full commits with messages, but drop to 0.6 on code alone while message-only models maintain close to original performance. This demonstrates reliance on textual patterns rather than semantic code understanding. We evaluate code language models (code LMs) ranging from \num{125}M to \num{15.5}B parameters across multiple architectures, finding only marginal improvements with scale. Repository-based splits expose 10-11\% performance drops compared to temporal splits, revealing models learn project-specific patterns rather than security semantics. We introduce a lightweight intra-procedural context enrichment method achieving \num{33}$\times$ speedup over existing approaches, yet additional intra-procedural context fails to improve detection. High inter-model agreement rates indicate convergence on similar patterns rather than complementary understanding. Overall, our findings suggest that code LMs appear fundamentally limited for code-centric security patch detection. We release our unified framework and evaluation suite to enable future \spd\ research.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 19672
Loading