Risk-Controlled CI Gating for LLM Code via Noisy-Analyzer Fusion

Shivani Shukla; Himanshu Joshi

Risk-Controlled CI Gating for LLM Code via Noisy-Analyzer Fusion

Shivani Shukla, Himanshu Joshi

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: conformal risk control, CI/CD, LLM-generated code, code security, analyzer fusion, noisy annotators, instance-dependent reliability, vulnerability detection, CWE, dynamic testing, SAST cross-validation, calibration, OOD generalization

TL;DR: Model learns when to trust each analyzer on LLM-generated code and powers a conformal, risk-controlled CI gate capping escaped vulns at alpha. On a multi-language benchmark with dynamic tests+SAST, it beats union/unanimity and generalizes OOD.

Abstract: Security tools frequently disagree on vulnerabilities in LLM-generated code, leaving CI pipelines to trade off developer friction against escaped defects. We introduce a risk-controlled CI framework that learns when to trust which tool and provides finite-sample guarantees on the escaped-vulnerability rate. First, we propose a Latent Vulnerability Model that fuses code representations with instance-dependent reliabilities of multiple analyzers (e.g., CodeQL, Semgrep, Bandit), estimated from a small gold set. The model outputs calibrated p(vuln∣x) and per-CWE reliability maps. Second, we derive a cost-aware CI policy and wrap it with Conformal Risk Control, yielding a deployment-time certificate that the EVR is less than or equal to alpha without distributional assumptions. To drive reproducibility, we release a multi-language benchmark (Python, JavaScript, and a typed language), comprising LLM-generated code across model families, dynamic tests and SAST cross-validation for ground truth, and loaders/CLI for evaluation. Across languages and unseen model families, our method dominates union/unanimity and stacking baselines at matched break-rates, maintains its EVR guarantees under stratified OOD calibration, and improves downstream repair ROI in a small CI pilot. Our results position risk-controlled tool fusion as a practical path to measurably safer LLM code generation in real engineering pipelines.

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Submission Number: 23502

Loading