Binary Diff Summarization using Large Language Models

Meet Udeshi; Venkata Sai Charan Putrevu; Prashanth Krishnamurthy; Prashant Anantharaman; Sean Carrick; Ramesh Karri; Farshad Khorrami

Binary Diff Summarization using Large Language Models

Meet Udeshi, Venkata Sai Charan Putrevu, Prashanth Krishnamurthy, Prashant Anantharaman, Sean Carrick, Ramesh Karri, Farshad Khorrami

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: binary analysis, supply chain security, binary diffing, large language models

TL;DR: A novel framework for binary diff summarization using LLMs and a novel functional sensitivity score help automate malware detection for software supply chain security.

Abstract: Security of software supply chains is necessary to ensure that software updates do not contain maliciously injected code or introduce vulnerabilities that may compromise the integrity of critical infrastructure. Verifying the integrity of software updates involves binary differential analysis (binary diffing) to highlight the changes between two binary versions by incorporating binary analysis and reverse engineering. Large language models (LLMs) have been applied to binary analysis to augment traditional tools by producing natural language summaries that cybersecurity experts can grasp for further analysis. Combining LLM-based binary code summarization with binary diffing can improve the LLM's focus on critical changes and enable complex tasks such as automated malware detection. To address this, we propose a novel framework for binary diff summarization using LLMs. We introduce a novel *functional sensitivity score* (FSS) that helps with automated triage of sensitive binary functions for downstream detection tasks. We create a *software supply chain security* benchmark by injecting 3 different malware into 6 open-source projects which generates 104 versions, 392 binary diffs, and 46,023 functions. On this, our framework achieves a precision of 0.95 and recall of 0.71 for malware detection, displaying high accuracy with low false positives. We outperform an existing industry-style rule-based baselines by $\approx 4\times$ higher recall on malware detection while maintaining high precision. Across malicious and benign functions, we achieve FSS separation of 3.0 points, confirming that FSS categorization can classify sensitive functions. We conduct a case study on the real-world XZ utils supply chain attack; our framework correctly detects the injected backdoor functions with high FSS.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 20988

Loading