A Comprehensive Evaluation of Code Language Models for Security Patch Detection

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: security, security patch detection, machine learning, vulnerability fixing commits, benchmark, code representation
TL;DR: Evaluation of the current state of code language models for classification of security patches.
Abstract: Detecting vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, yet advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories, driving interest in automated detection methods. We present a comprehensive evaluation of code language model (code LM) based VFC detection through a unified framework consolidating 20 fragmented datasets spanning more than \num{180,000} commits. Our analysis strenghtens existing observations through a systematic evaluation and reveals that high performance metrics mask fundamental limitations. Models achieve F1 scores of 0.9 when using full commits with messages, but drop to 0.6 on code alone while message-only models maintain close to original performance. This demonstrates reliance on textual patterns rather than semantic code understanding. We evaluate code LMs ranging from \num{125}M to \num{30}B parameters across multiple architectures, finding only marginal improvements with scale. Estimating out-of-distribution performance using repository-based splits exposes 10-11\% performance drops compared to temporal splits, revealing models learn project-specific patterns rather than security semantics. Even additional intra-procedural context fails to improve detection. Prompt-based classification with models up to 480B parameters also underperforms fine-tuned approaches, indicating limitations beyond model scale. High inter-model agreement rates indicate convergence on similar patterns rather than complementary understanding. Overall, our findings suggest that code LMs appear fundamentally limited for code-centric security patch detection. We release our unified framework and evaluation suite to enable future \spd\ research.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 19672
Loading