A Preliminary Study on Explaining Risk of Code Changes using LLM-based Prediction Models

Yalin Liu; Kosay Jabre; Rui Abreu; Zachariah J Carmichael; Vijayaraghavan Murali; Akshay Patel; Jun Ge; Weiyan Sun; Cong Zhang; Audris Mockus; David Khavari; Peter Rigby; Nachiappan Nagappan

A Preliminary Study on Explaining Risk of Code Changes using LLM-based Prediction Models

Yalin Liu, Kosay Jabre, Rui Abreu, Zachariah J Carmichael, Vijayaraghavan Murali, Akshay Patel, Jun Ge, Weiyan Sun, Cong Zhang, Audris Mockus, David Khavari, Peter Rigby, Nachiappan Nagappan

Published: 28 Mar 2026, Last Modified: 28 Mar 2026AIware 2026EveryoneRevisionsCC BY 4.0

Keywords: Code Risk Score, LLMs, Explainability, Applied Research.

Abstract: Predictions by machine learning (ML) and artificial intelligence (AI) models are often received skeptically unless they are paired with intelligible explanations. Many definitions and approaches to ML model explainability have been proposed. However, in the context of bug prediction, highlighting small portions of the software change (diff), beyond rule-based lints, where risk is concentrated has not yet been investigated. It can be argued that pragmatic ''highlighting explanations'' may help developers focus their testing and inspection efforts on the highest leverage parts of the code, potentially as effectively or even more so than theory-based explanations do. Unlike statistical model explanations such as ``many authors might indicate coordination failure,'' Large Language Models (LLMs) do not directly provide theory-based explanations. In this work, we identify which parts of a code change are risky by utilizing attention weights from a LLM-based Diff Risk Score (DRS) model. We evaluate our approach using expert-labeled changes that have caused real outages. Results show that code snippets highlighted by the LLM cover expert-labeled outage-causing change lines 53.85% of the time. In addition to providing developers essential clues, this approach also has the potential to improve trust in DRS' predictions. Furthermore, as attention weights are generated during inferences, attention-based explanations are highly scalable and efficient for real-world, large-scale software development workflows.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public.

Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages

Reroute: true

Submission Number: 16

Loading