Learning to Reason About Code Insecurity: Composite-Reinforcement Fine-Tuning for Cognitive Alignment

Jason Vickers Chen; Jialuo Yuan; Shuhao Guan; Ronghao Chen; Huacan Wang

Learning to Reason About Code Insecurity: Composite-Reinforcement Fine-Tuning for Cognitive Alignment

Jason Vickers Chen, Jialuo Yuan, Shuhao Guan, Ronghao Chen, Huacan Wang

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Insecurity, Transfer Learning, Reinforcement Learning

Abstract: Automated vulnerability analysis increasingly relies on language models, yet even strong LLMs exhibit unstable security reasoning: they either over-flag benign code or miss critical flaws, particularly under cross-language shifts. We present \textbf{\method}---\emph{Composite-Reinforcement Fine-Tuning for Cognitive Alignment}---a label-efficient training framework that explicitly optimizes a composite reward combining (i) \emph{label-based decision scoring} via a strictly proper scoring rule on predicted probabilities, (ii) \emph{explanation grounding and consistency} through structure- and code-referencing heuristics that do not use \cwe{} labels or definitions, and (iii) \emph{output-format coherence} through a strict schema validator. This moves the objective from bare classification toward deliberative, auditable analysis while explicitly acknowledging and isolating the supervised component in the reward. We cast each example as a short two-phase episode: first, the policy produces an explanation; then it deterministically emits a calibrated probability through a regression head. The binary decision is deterministically derived from the probability at inference (thresholding) rather than being sampled as a separate action. Policy updates are stabilized via batch-level affinity-weighted neighborhood smoothing over deterministic encodings and a KL trust term to a reference policy. Across \bigvul, \divvul, and \cleanvul, \method consistently improves macro-F1 over strong baselines (e.g., up to 0.71 in-distribution; substantial gains under cross-language transfer). Compared to standard supervised fine-tuning, \method reduces catastrophic bias toward predicting the vulnerable class and improves recognition of benign code without relying on \cwe{} supervision. We report duplicate-controlled splits, ablations of reward components, and significance testing.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 21728

Loading