arXiv:2506.18394v1  [cs.SE]  23 Jun 2025
Tracing Errors, Constructing Fixes: Repository-Level Memory
Error Repair via Typestate-Guided Context Retrieval
XIAO CHENG∗, University of New South Wales, Australia
ZHIHAO GUO∗, University of Technology Sydney, Australia
HUAN HUO, University of Technology Sydney, Australia
YULEI SUI, University of New South Wales, Australia
Memory-related errors in C programming continue to pose significant challenges in software development,
primarily due to the complexities of manual memory management inherent in the language. These errors
frequently serve as vectors for severe vulnerabilities, while their repair requires extensive knowledge of
program logic and C’s memory model. Automated Program Repair (APR) has emerged as a critical research
area to address these challenges. Traditional APR approaches rely on expert-designed strategies and predefined
templates, which are labor-intensive and constrained by the effectiveness of manual specifications. Deep
learning techniques offer a promising alternative by automatically extracting repair patterns, but they require
substantial training datasets and often lack interpretability.
This paper introduces LTFix, a novel approach that harnesses the potential of Large Language Models (LLMs)
for automated memory error repair, especially for complex repository-level errors that span multiple functions
and files. We address two fundamental challenges in LLM-based memory error repair: a limited understanding
of interprocedural memory management patterns and context window limitations for repository-wide analysis.
Our approach utilizes a finite typestate automaton to guide the tracking of error-propagation paths and
context trace, capturing both spatial (memory states) and temporal (execution history) dimensions of error
behavior. This typestate-guided context retrieval strategy provides the LLM with concise yet semantically rich
information relevant to erroneous memory management, effectively addressing the token limitation of LLMs.
Our framework has successfully repaired 37 out of 49 real-world memory errors derived from 14 open-source
projects that collectively comprise over a million lines of code. Compared to state-of-the-art memory error
APR tools, SAVER and ProveNFix, our approach correctly fixes 14.50× and 2.36× more errors, respectively.
Moreover, LTFix outperforms current open-source state-of-the-art LLM-based SWE-agent 1.0 by repairing
94% more errors while consuming 17M (41×) less tokens. We have also successfully repaired three critical
zero-day memory errors, with fixes that have been accepted and implemented by the original developers. These
results highlight a promising paradigm for repository-level program repair through program analysis-guided,
retrieval-augmented LLMs, combining formal verification strengths with neural model adaptability.
1
Introduction
Memory-related errors in C programming constitute a persistent and formidable challenge in
software development [6, 38, 69, 84, 88]. These errors frequently serve as vectors for zero-day attacks,
resulting in severe consequences such as data corruption [76, 86], denial-of-service incidents [5, 8],
and information leakage [37, 65]. The intrinsic complexity of memory management in C, coupled
with the language’s low-level semantics, renders the detection and repair of these errors both
labor-intensive and error-prone. Consequently, automated program repair (APR), which mitigates
the need for exhaustive manual error analysis and repair, has emerged as a critical research domain
in software engineering over the past decade [26, 35, 36, 52].
Conventional APR approaches [25, 28, 38, 46, 47, 55, 66, 72] rely on predefined repair templates
and expert-crafted heuristics. The manual creation of repair rules is labor-intensive, and the
∗Xiao Cheng and Zhihao Guo contributed equally to this work.
Authors’ Contact Information: Xiao Cheng, jumormt@gmail.com, University of New South Wales, Sydney, NSW, Australia;
Zhihao Guo, guodududu@gmail.com, University of Technology Sydney, Sydney, NSW, Australia; Huan Huo, huan.huo@
uts.edu.au, University of Technology Sydney, Sydney, NSW, Australia; Yulei Sui, y.sui@unsw.edu.au, University of New
South Wales, Sydney, NSW, Australia.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
effectiveness of repairs is constrained by the precision and completeness of human-specified
rules. Moreover, even when memory errors can be successfully replayed and their root causes
identified, the repair process can be still challenging without a comprehensive understanding of
repository-level context and program semantics [28]. For instance, addressing memory leaks may
require meticulous handling of complex data structure deallocation, while resolving use-after-
free errors demands careful consideration of pointer aliasing relationships and strategic control
flow restructuring (see examples in Figures 2 and 7). These complexities necessitate not only an
understanding of program-specific memory management paradigms but also a comprehensive
grasp of broader program semantics to ensure overall program correctness rather than merely
eliminating the immediate error manifestation.
Recent advancements in deep learning (DL) have emerged as a promising direction to address
these limitations through their capability to learn complex program semantics and repair patterns
from large-scale codebases [9, 14, 31, 40, 41, 44, 71, 95, 96]. In contrast to rule-based approaches, DL
methods can automatically extract and generalize fix patterns across diverse contexts, eliminating
the need for manually crafted repair templates. These approaches excel at capturing intricate
relationships between code structure, program semantics, and memory management patterns,
potentially enabling more sophisticated and context-aware repairs. However, DL-based approaches
require substantial amounts of memory error repair data for training, which is particularly chal-
lenging to obtain given the relative scarcity and complexity of these errors in real-world codebases.
Additionally, the inherent opacity of deep learning models—their "black box" nature—obscures the
reasoning behind generated fixes, potentially undermining developer confidence in their effective-
ness and making it difficult to validate the correctness of proposed repairs.
Large Language Models (LLMs) have emerged as a compelling alternative to both traditional rule-
based and DL-based approaches for automated memory error repair. Unlike conventional methods
constrained by predefined templates and expert-crafted rules, LLMs leverage their extensive training
on vast code repositories and natural language corpora to comprehend complex program semantics.
Furthermore, in contrast to DL approaches that demand substantial memory-error-specific training
data, LLMs can utilize their generalized understanding of comprehensive codebases to address
repair tasks with limited domain-specific examples. However, despite these inherent advantages, the
potential of LLMs for memory error repair remains largely unexplored. While existing LLM-based
approaches [33, 34, 79, 81–83, 94] have demonstrated success in general bug-fixing scenarios, they
fall short in addressing the unique challenges of memory error repair.
Memory errors frequently manifest as interprocedural phenomena requiring comprehensive
semantic understanding across entire repositories—a complexity that exceeds the capabilities of
current approaches primarily focused on localized fixes within isolated functions or files. To illus-
trate this challenge empirically, consider a use-after-free error (detailed in §3). This error manifests
across six distinct functions, with a correct repair requiring synchronized modifications at three
separate locations. In our experimental evaluation, we found that LLMs, even when provided with
both the error-triggering function and its complete calling chain alongside a precise error root cause
analysis, consistently failed to generate semantically correct patches. This limitation stems funda-
mentally from the inherent complexity of repository-level memory error repair, which demands a
comprehensive understanding of long-range memory management contexts and interprocedural
dependencies. These challenges can be characterized as follows:
Challenge#1: Limited understanding of interprocedural memory management patterns.
Memory errors in production systems frequently manifest through sophisticated management
patterns that span multiple functions, involve intricate data structures, and exhibit complex pointer
aliasing relationships. While LLMs have demonstrated remarkable capability in comprehending
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
intraprocedural program structures [39], their ability to grasp the nuanced semantics of memory
errors—particularly those with complex interprocedural triggering patterns—remains limited.
Challenge#2: Context window limitations for repository-wide analysis. Memory errors
often span multiple functions and execution contexts across a repository, with both error-triggering
logic and corresponding repair patches potentially distributed throughout various functions in
large-scale production codebases. This presents a fundamental challenge for LLMs, which are
constrained by token limitations and exhibit performance degradation with extensive prompts—a
phenomenon known as “lost in the middle” [43]. These constraints significantly impair LLMs’ ability
to process and reason about the comprehensive contextual information necessary for effective
memory error analysis and repair.
Drawing inspiration from established developer practices for memory error resolution, we pro-
pose an approach that methodically emulates this systematic debugging process. When addressing
memory errors, developers typically follow three essential steps: 1) reproducing the error through
test cases and specialized program analysis tools such as ASan [63]; 2) deploying debugging utilities
like GDB [21] to establish strategic breakpoints and comprehensively analyze program dependen-
cies and execution context; and 3) applying domain-specific knowledge of safety specifications to
implement semantically robust fixes. Based on these observations, we present LTFix, a novel LLM-
based approach for automatic memory error repair in C programs that leverages context-aware
retrieval augmented by typestate analysis [68]. Our architecture positions the LLM as a reasoning
engine that operates synergistically with established program analysis techniques and debugging
infrastructure to facilitate comprehensive error comprehension and correction. This integration
effectively bridges the gap between LLMs’ linguistic capabilities and the specialized contextual
understanding required for effective memory error repair.
To address Challenge#1, we formalize memory error semantics through tracking the error-
propagation path and monitoring program context transitions. The propagation path encodes
the temporal execution history leading to error manifestation, while the context at each critical
program point encapsulates both memory management states and detailed backtrace informa-
tion of interprocedural calling chains. This comprehensive representation enables LLMs to model
complex state evolution patterns across function boundaries by preserving both spatial (memory
states) and temporal (execution history) dimensions of error behavior. In response to Challenge#2,
we introduce an efficient context collection strategy governed by a finite typestate automaton
(FTA) [4, 11, 13, 18, 68]. Rather than exhaustively capturing contexts at every execution point—which
would overwhelm LLMs’ context windows—the FTA guides the collection process by monitoring
object lifetime phases from allocation to error manifestation, triggering context snapshots exclu-
sively at semantically significant state transitions. This selective approach generates a concise yet
semantically rich context trace that preserves essential memory management semantics while
enabling LLM-based repair to scale effectively to repository-wide analysis.
Figure 1 provides an overview of our framework consisting of three phases:
(a) Error Replay. The input to this framework is a flawed code repository and its proof of
concept (PoC) input, which then undergo a preliminary analysis that uses a dynamic analysis tool
to replay the memory error and pinpoint the error-triggering location and the specific error type.
(b) Typestate-Guided Context Retrieval. Guided by a finite typestate automaton (FTA), we
use a debugger to execute the program step-by-step, extracting the error-propagation path from the
nearest related memory allocation to the error-triggering point. We then construct a context trace
of the memory error, starting from the memory allocation and following each memory operation
across the program. The context tracing is guided by the FTA, allowing for efficient tracking of
program contexts only at typestate-changing breakpoints. Each context includes three elements:
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
(b) Typestate-Guided Context Retrieval
(a) Error Replay
(c) Prompting LLM for Program Repair 
Finite Typestate Automaton
2
3
1
Context Trace
#1
#2
#3
alloc
free
use
Backtrace
Location
...
...
...
Flawed
Repository
Error
Replay
Error Report
Error: "Use-after-free"
Line: 18
Method: "foo"
File: "error.c"
Debugger
Error Report
Role Play
Structured Prompting
Error-Propagation
Path
Patch and
Explanation(s)
live
dead
UAF
PoC
Input
...
2
3
1
alloc
free
use
...
...
...
Error-Propagation
Path
alloc
use/realloc
free
use
set_null
*
alloc
uninit
:uninit
:live
:dead
:UAF
Typestate
Transition
Context Trace
Fig. 1. An overview of our framework.
the location of the current point, the typestate transition, and the backtrace of the calling stack
gathered at the current breakpoint.
(c) Prompting LLM for Program Repair. Finally, we design a multi-step structured prompting
method that incrementally deliver role and task description [64], error report, context trace and
error-propagation path to the LLM for generating an appropriate patch and explanation(s).
Our major contributions are as follows:
• We introduce LTFix, the first repository-level memory error repair system that synergizes Large
Language Models (LLMs) with context-aware semantic retrieval and typestate-guided program
analysis. Our methodology advances conventional APR by enabling LLMs to systematically
derive correct patches through establishing memory error semantics via interaction with runtime
debuggers and formal typestate verification.
• We propose a novel typestate-guided context retrieval methodology that precisely captures
and synthesizes critical memory error contextual information during typestate transitions. This
targeted approach enables LLMs to comprehend complex memory error semantics while main-
taining focus on concise, semantically relevant code segments, thereby significantly improving
repair accuracy and efficiency.
• We have established a real-world memory error database encompassing 49 errors, their proofs of
concept (PoCs), and their respective fixes across 14 projects, collectively containing over a million
lines of code. Our framework successfully repairs 37 memory errors, surpassing the performance
of current automated memory error repair tools. We also successfully addressed three zero-day
memory errors, with our solutions accepted and implemented by the original developers. Full
details will be disclosed upon paper acceptance.
2
Preliminaries and Problem Formulation
This section establishes the theoretical foundation for memory error detection using typestate
analysis and examines the capabilities of Large Language Models (LLMs) in code comprehension.
We then formulate the automated program repair problem for memory errors within the context of
LLM-based approaches.
2.1
Memory Errors and Typestate Analysis
Memory errors in C programming often stem from improper memory management, necessitating
careful tracking of memory’s temporal properties along the program’s control flow. Typestate
analysis [4, 13, 18, 68] emerges as an effective method for detecting and understanding these errors,
as it monitors execution logic by tracking the temporal state changes of memory objects. This
approach represents different states of a given memory object and their transitions using a finite
typestate automaton (Definition 1), allowing for precise modeling of memory object lifecycles. By
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
Table 1. Finite typestate automata of use-after-frees (UAF), double-frees (DF) and memory leaks (ML).
AUAF = ⟨Σ, T,𝑇𝑢,𝛿,𝑇UAF⟩
ADF = ⟨Σ, T,𝑇𝑢,𝛿,𝑇DF⟩
AML = ⟨Σ, T,𝑇𝑢,𝛿,𝑇ML⟩
T = {𝑇𝑢,𝑇𝑙,𝑇𝑑,𝑇UAF}
T = {𝑇𝑢,𝑇𝑙,𝑇𝑑,𝑇DF}
T = {𝑇𝑢,𝑇𝑙,𝑇𝑑,𝑇ML}
Σ = {alloc, free, use, realloc, set_null}
Σ = {alloc, free, realloc, set_null}
Σ = {alloc, free, realloc, exit}
alloc
use/realloc
free
use
set_null
alloc
*
alloc
realloc
free
free
set_null
alloc
*
alloc
realloc
free
exit
alloc
*
𝑇UAF: Use-after-free error state.
𝑇DF: Double-free error state.
𝑇ML: Memory leak error state.
𝑇𝑢: Uninitialized state (memory object is not yet allocated); 𝑇𝑙: Live state (memory object is allocated and in use);
𝑇𝑑: Dead state (memory object is released).
use: Use a heap object; set_null: Set the pointer pointing the heap object to null.
exit: Return from main function
alloc: Allocate a heap memory object; realloc: Reallocate a heap memory object; free: Free a heap object;
capturing the various states a memory object can occupy (such as allocated, initialized, or freed)
and tracking transitions between these states during program execution, typestate analysis can
identify violations of expected state sequences—often indicative of memory errors.
Definition 1 (Finite Typestate Automaton). A finite typestate automaton (FTA) for an error ET
is a quintuple denoted as AET = ⟨Σ, T,𝑇𝑢,𝛿,𝑇ET⟩. The language Σ signifies the operations (e.g.,
function calls) that can be performed on the typestates. T encompasses all the possible typestates,
with 𝑇𝑢∈T representing the initial state. 𝛿: (T × Σ) →T is the state-transition table encoding
the effects of operations in Σ. 𝑇ET is the error typestate indicating a potential error detected. For a
program statement 𝑠, we use op(𝑠) to retrieve its corresponding operation in Σ.
Specifications for Memory Errors. In this paper, we focus on three critical yet difficult-to-fix
memory errors [28]: use-after-frees [51], double-frees [50], and memory leaks [49]. Table 1 presents
the specifications of these errors in the form of finite typestate automata. Each automaton is
represented as a graph where each node corresponds to a specific typestate, and the edges between
these nodes are annotated with transition operations, illustrating the transition relationships
between different states. The analysis process begins with an uninitialized typestate, denoted as 𝑇𝑢,
and then may advance through different typestate transitions depending on the program statements
encountered. For example, 𝑇𝑢transitions to 𝑇𝑙upon encountering a memory allocation statement,
such as the primitive heap allocation API malloc. If released memory (𝑇𝑑) is used or freed again, it
transitions to 𝑇UAF and 𝑇DF, representing a use-after-free or double-free error as per AUAF and ADF
respectively. Similarly, a live heap memory object (𝑇𝑙) transitioning to 𝑇ML at program exit as per
AML indicates a memory leak.
2.2
Large Language Models for Code Comprehension
Large Language Models (LLMs) demonstrate exceptional proficiency in code comprehension and
manipulation, establishing themselves as powerful tools for software engineering tasks. Trained
on extensive corpora encompassing both natural language and source code, these models exhibit
sophisticated capabilities in parsing and interpreting complex program structures. Recent research
has shown that LLMs excel in several domains critical for program analysis: they effectively inter-
face with external APIs [73], resolve indirect call relationships [10], and conduct nuanced data flow
analyses [74, 75]. Their capacity extends to recognizing intricate program dependencies, evaluat-
ing path conditions, and inferring control flow constructs [30], which enables robust reasoning
about diverse execution paths, including those within iterative structures [39]. Moreover, these
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
models demonstrate significant utility in understanding and synthesizing formal verification com-
ponents, such as loop invariants [59] and contractual specifications including preconditions and
postconditions [80]. These advanced code comprehension capabilities of LLMs present compelling
opportunities for their application in automated memory error repair—the primary focus of our
research.
2.3
Problem Formulation
Our objective is to automatically generate a patch to repair a flawed C repository with memory
errors by leveraging the power of LLM and the prompts constructed based on FTA specifications
(Table 1). The patch should fix the memory error and not introduce new bugs. Formally, let 𝑃be
the original flawed C repository (the error type is ET) and 𝐼be a specific proof of concept (PoC)
input. We generate a set of prompts ⟦𝑄⟧via the function M:
⟦𝑄⟧= M(𝑃, AET, 𝐼)
The LLM’s role is to take ⟦𝑄⟧to generate a correct patch Δ𝑃:
Δ𝑃= LLM(⟦𝑄⟧)
The patch Δ𝑃, when applied to 𝑃, should yield a updated repository 𝑃′ = 𝑃+ Δ𝑃that fixes the
error without introducing new bugs.
To ensure the patch’s effectiveness and reliability, we impose these constraints:
(1) The patched repository 𝑃′ should not exhibit the memory error when executed with 𝐼:
𝑓(𝑃′, 𝐼) ≠memory error
where 𝑓(𝑃, 𝐼) denotes the execution of 𝑃with input 𝐼.
(2) Let I represent the available test-suite for the repository. The patched repository 𝑃′ should
not introduce new bugs, meaning that for any test case in the provided test-suite 𝐼′ ∈I, the
execution 𝑓(𝑃′, 𝐼′) should produce correct results:
∀𝐼′ ∈I, 𝑓(𝑃′, 𝐼′) = expected result
We formulate our memory error APR problem as follows:
Given a flawed C repository 𝑃and a PoC input 𝐼, we design a method M to construct prompts
⟦𝑄⟧that guide the LLM to generate a patch Δ𝑃. The goal is to ensure that the patch Δ𝑃, when
applied to 𝑃, fixes the memory error for 𝐼and maintains correctness for its test-suite I.
3
A Motivating Example
Figure 2 illustrates the pipeline of LTFix, walking through the three phases depicted in Figure 1.
These phases are demonstrated using a use-after-free error. A heap memory object is initially
allocated by the create_context function, which wraps a malloc call at Line 27 in the test.c
file ( 1○). This memory is freed by invoking the release_context method, which calls a free
function at Line 37 in test.c ( 2○). However, the released memory is erroneously used in the
method clone_data at Line 21 in test.c ( 3○). This use-after-free error traverses six functions from
allocation ( 1○) to the error-triggering point ( 3○).
Challenges. The successful repair of this memory error poses four fundamental challenges. First,
it demands a thorough understanding of the allocated memory structure (Context), necessitating
proper deallocation and null pointer validation mechanisms for both the base object (Context) and
its associated fields (data). Second, the repair requires careful restructuring of operation sequences
to maintain program correctness, particularly ensuring that release_context executes after its
dependent operation copy_ctx, thus preventing critical logic in copy_ctx from being invalidated
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
by premature null pointer checks. Third, the fix requires semantic comprehension of the codebase
to implement a deep copy operation using memcpy for src, rather than a potentially unsafe shallow
copy of src->data. Finally, the repair must correctly handle interprocedural interactions along the
error-propagation path, ensuring consistent memory management across function boundaries.
Existing Efforts. Notably, state-of-the-art memory error APR tools, including ProveNFix [66]
and SAVER [28], fail to repair this memory error even when being provided with the precise error
location and configured with their most sophisticated options, such as flow-sensitive analysis and
header file parsing. They fall short in understanding the hierarchical structure of the Context
object, consequently implementing only superficial null checks on src rather than comprehensively
addressing the whole structure. Their repair strategies further exhibit semantic miscomprehension
by introducing premature returns after the release_context call which, while eliminating the
immediate error, prevents the execution of subsequent critical operations and compromises program
integrity. The LLM-based approach SWE-agent 1.0 [89] similarly fails to generate an appropriate
patch due to its inability to identify and reason about the interprocedural execution logic of this error.
Furthermore, our attempts to generate a patch using only the error report and the error-triggering
function clone_data, or even the complete calling chain or test.c file, were unsuccessful.
LTFix Approach. In Phase (a), the use-after-free error is replayed and confirmed. Phase (b)
identifies the relevant memory operations according to the finite typestate automaton AUAF in
Table 1, extracting the error-propagation path and relevant program contexts. Phase (c) feeds the
error report, context trace, and error-propagation path from the previous phases into the LLM, which
infers the correct patch and provides clear explanations for its decisions, as shown at the bottom of
Figure 2. The generated patch demonstrates sophisticated interprocedural context and semantics
awareness, implementing coordinated modifications across three distinct code segments spanning
two different functions. This solution aligns precisely with the ground truth that eliminating this
error while preserving logical correctness of the project.
(a) Error Replay. Understanding the root cause of a memory error often requires replaying the
error. As illustrated in Figure 2(a), we reproduce this error using its proof of concept (PoC) input,
which is a specific assembly file. We employ Valgrind [54], a widely-used dynamic analysis tool, to
capture the core dump of the PoC input at the moment the error is triggered—specifically, at Line
21 in the test.c file. This core dump provides a snapshot of the memory state at the time of the
crash. Based on the core dump, we generate an error report that offers the specific error type and
error location to the LLM.
(b) Typestate-Guided Context Retrieval. We employ a finite typestate automaton (AUAF in
Table 1) to facilitate the extraction of the error-propagation path using the GDB debugger [21].
This path begins with the memory allocation at 1○, proceeds through the memory free at 2○, and
ultimately stops at the error-triggering point at 3○, as indicated by an error typestate. Along the path,
we use AUAF to guide typestate transitions of the erroneously accessed memory object obtained from
the previous step. This allows us to monitor the typestate changes and the corresponding contexts
at each typestate change breakpoint with the GDB debugger. This rigorous tracing enables the LLM
to comprehend the structure of the allocated memory object (e.g., Context), the interprocedural
evolution of the error across the code repository, and the broader program semantics (logical
correctness) along the execution trace.
The key transitions in this model are triggered by the malloc, free, and use statements, which
occur at 1○, 2○, and 3○, respectively. Upon invoking the malloc function, denoted as the alloc
operation in our FTA, the typestate of the memory object shifts to live, indicating that the memory
is currently in use. Subsequently, when the free API is called, the typestate changes to dead,
signifying that the memory has been released. However, if the same memory object is used after
it is released, the typestate changes to error, indicating a use-after-free error. At each typestate
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
(c) Prompting LLM for Program Repair 
(b) Typestate-Guided Context Retrieval
(a) Error Replay
Patch and Explanations
@@ test.c
    ... 
36  free(ctx->data);
37  free(ctx);
Error Report
Code
Sanitizer
live
UAF
from uninit to live via alloc
test.c:27
test.c:36
from live to dead via free
Backtrace: 
#0 ...ctx.c:36:free(ctx);
#1 ...ctx.c:48:release_context(ctx);
#...
test.c:21
from dead to error via use
Backtrace:
#0 ...ctx.c:21: new_ctx->data = src->data;
#1 ...ctx.c:41: Context* copy_ctx = clone_data(ctx);
#...
Context Trace
You are a good
programmer specialized
in C programming. You
can fix...
Here is the
context
trace...
Here is the error report...
Here is the error-
triggering path...
1
2
3
4
--- a/test.c +++ b/test.c
@@ -47,2 +48,3 @@ void put_ctx() {
-    release_context(ctx);  
     Context* ctx2 = copy_ctx(ctx);
+    release_context(ctx);  
+    ctx = NULL; 
Explanation:  It ensures source validity before allocation and allocates
memory only for the data size. It uses memcpy instead of direct
assignment to safely copy data.
--- a/test.c +++ b/test.c
@@ -21,2 +21,4 @@ Context* clone_data(Context* src) {
+    if (!src || !src->data) return NULL;
     Context* new_ctx = malloc(sizeof(Context));
-    new_ctx->data = src->data;    
+    new_ctx->data = malloc(sizeof(int) * src->size);  
+    memcpy(new_ctx->data,src->data,sizeof(int)*src->size);
Explanation: it ensures
"ctx" is copied before
being released. It also
sets "ctx" pointer to
NULL after release.
2
@@ test.c
   ...
26 Context* create_ctx(...){
27 Context* ctx = malloc
(sizeof(Context));
@@ test.c
   ...
21 new_ctx->size = src->size;
3
2
1
......
......
Error-Propagation Path
Source Code and Typestate
Transitions
...
...
...
test.c:27...
......
test.c:36...
......
test.c:21...
uninit
Backtrace: 
#0 ...ctx.c:27:Context* ctx=malloc(sizeof(Context));
#1 ...ctx.c:47:Context* ctx=create_ctx(10);
#...
Error: Use-after-free
Line: 21
Method: clone_data
File: test.c
Error: Use-after-free
Line: 21
Method: clone_data
File: test.c
1
3
dead
#1
#2
#3
#1
...
...
...
#2
...
...
...
#3
GDB
Flawed
Repository
PoC
Input
Fig. 2. A motivating example illustrating how LTFix repairs a use-after-free error.
change point, we extract its associated context, which includes the typestate transition, the location,
and the backtrace. The program contexts at the three typestate change points collectively form
a context trace. For instance, in the final error context, the typestate transition indicates a shift
from dead to error due to a use operation. The location, test.c:21, specifies the file path and
line number. The backtrace, gathered from the debugger at this breakpoint, provides the call stack
details, illustrating how the use operation was invoked by clone_data and other higher-level
callers in the codebase.
(c) Prompting LLM for Program Repair. In this phase, depicted in Figure 2(c), we employ
structured prompting [27] to break down the prompts into four structured steps: role and task
description [64], error report, context trace, and error-propagation path. These segments of infor-
mation are systematically fed to the LLM step by step. This method effectively deconstructs the task
of program repair into increasingly detailed and specific stages, allowing the LLM to progressively
comprehend and tackle the error.
4
LTFix Approach
In this section, we detail our LTFix approach. We first identify the specifics of the memory error
(§4.1), then utilize typestate-guided context retrieval to understand its semantics (§4.2). Finally, we
leverage this information to generate a patch and explanation(s) via LLM (§4.3).
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
4.1
Error Replay
In this paper, our emphasis is not on detecting memory errors but on repairing validated true
errors. The first step of our approach involves reproducing the error and identifying the associated
bug report, which includes the type of memory error and its location. To achieve this, we utilize a
dynamic analysis tool (DAT) such as ASan [63] or Valgrind [54] during compilation and generate
debuggable files following the DWARF 5 [15] standard. We then use specific proofs of concept
(PoCs) to reproduce the error and employ DAT to generate detailed error reports. At the error-
triggering point, the DAT outputs information about the type of memory error, the location, and
the specific memory address that is erroneously accessed (i.e., the error address).
4.2
Typestate-Guided Context Retrieval
This section elaborates on how we enhance LLM prompts with a deep understanding of memory
error semantics while optimizing token usage. We achieve this by employing typestate finite state
automata (Definition 1) to guide the derivation of a comprehensive yet precise error-propagation
path (Definition 5) and context trace (Definition 6) that capture the interprocedural contextual
evolution linked with the memory error. This serves as a key component for patch generation in
the subsequent phase.
We first demonstrate how to construct the full execution path of the error (Definition 2) and the
typestate-changing context map (Definition 4) using Algorithm 1. The context map records the
program context (Definition 3) specifically at the statements introducing typestate changes. Using
the full path and context map as a foundation, we then illustrate how to build the error-propagation
path as explained in the rule in Figure 4 and the context trace as outlined in Figure 6. We also
discuss the correctness of each algorithm and inference rule.
Definition 2 (Full Execution Path 𝜋). A full execution path, denoted as 𝜋= (𝑠𝑖)𝑛
𝑖=1, is a chronolog-
ically ordered sequence of program statements, each assigned an index, such that 𝑠𝑖= ⟨sc(𝑠𝑖),𝑖⟩.
Here, sc(𝑠𝑖) retrieves the source code associated with 𝑠𝑖, while 𝑖represents the position of this
statement in the sequence, indicating the execution order. This sequence starts from the program’s
entry point and extends all the way to the error-triggering statement.
Note that for any 𝑠𝑖,𝑠𝑗in 𝜋where 𝑖≠𝑗, their corresponding source code can be identical, i.e.,
sc(𝑠𝑖) = sc(𝑠𝑗), because a single code line can be invoked or executed multiple times throughout
the program’s run, depending on the control flow of the program.
Definition 3 (Program Context Ctx). Given a full execution path 𝜋, for any statement 𝑠𝑖∈𝜋, the
program context of 𝑠𝑖is defined as:
Ctx𝑠𝑖= ⟨lc, tr, cp⟩
where:
• lc denotes the program location (file path and line number) of 𝑠𝑖,
• tr represents the typestate transition at 𝑠𝑖, encapsulating the pre- and post-typestates and
the memory operation, and
• cp signifies the backtrace of the call path, showing the call sequence leading up to the current
point of execution. Specifically, it records each function call along with its source location
and the corresponding source code at that location.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
Algorithm 1: Full execution path and typestate-changing context map construction.
Input: addr𝑒: Memory address erroneously accessed (error address); 𝑃: Target program; 𝐼: PoC input;
AET = ⟨Σ, T,𝑇𝑢,𝛿,𝑇ET⟩: Finite typestate automaton;
Output: 𝜋: Full execution path; tMap: Map from typestate-changing statements to program contexts
1 Function constructPi(addr𝑒, P, I, AET):
2
GDB.execute(“file ”+ 𝑃);
// Set the program to execute
3
GDB.execute(“set arg ”+ 𝐼);
// Set the program PoC input
4
GDB.execute(“start”);
5
𝜋←(); tMap ←{}; 𝑇←𝑇𝑢; 𝑖←1;
6
frame ←GDB.selected_frame();
7
while 𝑇≠𝑇ET do
8
ℓ←frame.code_line;
9
𝑠←⟨ℓ,𝑖⟩;
10
if op(𝑠) ∈Σ ∧frame.addr = addr𝑒then
// Whether the operation of 𝑠belongs to
the FTA language and 𝑠manipulates the error address
11
if (𝑇′ ←𝛿(𝑇, op(𝑠))) ≠𝑇then
// Typestate changes
12
backtrace ←GDB.execute("backtrace");
13
Ctx𝑠←{frame.location, transition(𝑇,𝑇′, op(𝑠)), backtrace};
14
tMap ←tMap ∪{𝑠↦→Ctx𝑠};// Record typestate-changing context in tMap
15
𝑇←𝑇′;
16
𝜋←𝜋◦𝑠;
// Append 𝑠to 𝜋
17
𝑖++;
18
GDB.execute("step");
19
frame ←GDB.selected_frame();
20
return 𝜋, tMap;
Definition 4 (Typestate-Changing Context Map tMap). A typestate-changing context map tMap
associates statements inducing typestate changes with their respective program contexts. Impor-
tantly, each memory object manipulated by the statements in tMap’s key set must be aliased with
the object at the error-triggering point. This aliasing is evidenced by their shared error address.
Full Execution Path and Typestate-Changing Context Map Construction. Algorithm 1
outlines the construction of the full execution path (𝜋) and the typestate-changing context map
(tMap) using the GNU Debugger (GDB) [21]. The algorithm initializes GDB and runs the program
with input 𝐼. Within the loop (Lines 7-19), it constructs 𝜋and tMap step-by-step until an error
state occurs. Each loop iteration, representing a program execution step, appends the statement to
𝜋(Line 16), and collects data from the current stack frame. If an operation manipulates the error
address and causes a typestate change (Lines 10-11), the algorithm inserts the program context,
including the current source code line, typestate transition and backtrace, into tMap (Lines 13-14).
Example 1. Figure 3(a) shows a double-free error at ℓ3 (invoked from ℓ15). In this example, the
program execution flows through 10 steps as follows:
𝜋= (𝑠𝑖)10
𝑖=1 = (⟨ℓ9, 1⟩, ⟨ℓ6, 2⟩, ⟨ℓ10, 3⟩, ⟨ℓ11, 4⟩, ⟨ℓ12, 5⟩, ⟨ℓ6, 6⟩, ⟨ℓ13, 7⟩, ⟨ℓ14, 8⟩, ⟨ℓ15, 9⟩, ⟨ℓ3, 10⟩)
where ℓ𝑛refers to the source code at Line 𝑛for this example. The process of typestate transition
and the corresponding results are illustrated in Figure 3(b). The statements that induce typestate
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
1   void dsy(int* i)
2   {
3      free(i);
4   }
5   int* xmalloc(int size){
6      return malloc(size);
7   }
8   int main(){
9     int* i = xmalloc(10);
10    free(i);       //safe
11    i = NULL;
12    i = xmalloc(10);
13    int* j = malloc(10);
14    free(i);      // safe
15    dsy(i);       // error
16  }
(alloc)
 
entry
(a) Source Code
(b) Typestate Transitions
(free)
(set_null)
(free)
(free)
(c) Program Contexts
                   Ctxs6 
Location: df.c:6                     
Typestate transition: 
    from unint to live via alloc
Backtrace:
    #0 ... in xmalloc (...) at df.c:6:
         return malloc(size);
     #1 ... in main () at df.c:12: 
        i = xmalloc(10);
......
......
(alloc)
Fig. 3. An example of typestate transitions and program contexts.
changes are as follows:
𝑠2 : ⟨ℓ6, 2⟩,
𝑠3 : ⟨ℓ10, 3⟩,
𝑠4 : ⟨ℓ11, 4⟩,
𝑠6 : ⟨ℓ6, 6⟩,
𝑠8 : ⟨ℓ14, 8⟩,
𝑠10 : ⟨ℓ3, 10⟩
𝑠7 : ⟨ℓ13, 7⟩does not belong to these statements because it operates on a different address from
addr𝑒. Accordingly, we construct the typestate context map tMap as follows:
tMap = {𝑠2 ↦→Ctx𝑠2,𝑠3 ↦→Ctx𝑠3,𝑠4 ↦→Ctx𝑠4,𝑠6 ↦→Ctx𝑠6,𝑠8 ↦→Ctx𝑠8,𝑠10 ↦→Ctx𝑠10}
where Ctx𝑠𝑖denotes the program context at statement 𝑠𝑖. The mapping (𝑠𝑛↦→Ctx𝑠𝑛) represents
the association between the statement 𝑠𝑛and its corresponding program context Ctx𝑠𝑛, as defined
in Definition 4. For instance, the program context Ctx𝑠6, depicted in Figure 3(c), includes the
statement’s location, detailed typestate transition information, and backtrace data. The backtrace
clearly shows the call stack from the main function to the xmalloc function as well as the exact
line of source code at each stack frame.
Lemma 4.1 (Correctness of Algorithm 1). Algorithm 1 correctly constructs 𝜋according to
Definition 2 and tMap according to Definition 4.
Proof Sketch. In Algorithm 1, Lines 2-4 ensure that 𝑠1 is the program’s entry point. The
condition in Line 7 guarantees loop termination at the error state, which indicates an error-
triggering statement (according to Definition 1). Lines 18-19 ensure that the loop follows the
program execution order. Thus, the construction of 𝜋aligns with Definition 2. Additionally, the
conditions in Lines 10 and 11 conform to Definition 4 for constructing tMap.
□
Definition 5 (Error-Propagation Path 𝜋𝑒). An error-propagation path 𝜋𝑒= (𝑠𝑖)𝑛
𝑖=𝑚is a subsequence
of indexed program statements that extends from the memory allocation statement 𝑠𝑚, where
0 ≤𝑚< 𝑛∧op(𝑠𝑚) = alloc, to the error-triggering point 𝑠𝑛. 𝑠𝑚is the closest memory allocation to
𝑠𝑛on the full path 𝜋where the allocated memory address is the error address accessed at 𝑠𝑛.
Typestate-Guided Error-Propagation Path Extraction. The inference rule depicted in Figure 4
is used to extract the error-propagation path, 𝜋𝑒, from the full execution path, 𝜋(Definition 2).
This rule helps reduce the code lines and isolate the error code from the executed code. It works
by identifying a particular statement, 𝑠𝑚, associated with an operation that changes the typestate
allocation, guided by the tMap, which tracks these typestate changes (as per Definition 3). The rule
ensures that any statements on the execution path 𝜋falling between𝑠𝑚and𝑠𝑛do not manipulate the
error address or involve memory allocation. This isolates the statements likely to be the root cause
of the error, similar to a debugging process that narrows down and focuses only on problematic
code segments, effectively reducing the length of the message input into the large language model.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
[ETP] 𝜋= (𝑠𝑖)𝑛
𝑖=1 𝑠𝑚∈KS(tMap) ∧op(𝑠𝑚) = alloc ∧(∀𝑘,𝑚< 𝑘≤𝑛: 𝑠𝑘∉KS(tMap) ∨op(𝑠𝑘) ≠alloc)
𝜋𝑒←(𝑠𝑖)𝑛
𝑖=𝑚
Fig. 4. The inference rule for typestate-guided error-propagation path extraction. KS(tMap) = {𝑠𝑖| (𝑠𝑖↦→
Ctxsi) ∈tMap} represents the key set of tMap.
Location: df.c:6                     
Typestate transition: 
    from unint to live via alloc
Backtrace:
     #0 ... in xmalloc (...) at df.c:6: 
        return malloc(size);
     #1 ... in main () at df.c:12: i = xmalloc(10);
#1
 Location: df.c:14                   
Typestate transition:                            
    from live to dead via free  
Backtrace:
      #0 ... in main () at df.c:14: free(i); 
Location: df.c:3                     
Typestate transition: 
    from dead to error via free
Backtrace:
     #0 ... in dsy (i=...) at df.c:3:free(i);         
     #1 ... in main () at df.c:15:dsy(i);
#2
#3
Fig. 5. An example of context trace by revisiting Example 1.
Example 2. The error-propagation path of Example 1 is:
𝜋𝑒= (𝑠𝑖)10
𝑖=6 = (⟨ℓ6, 6⟩, ⟨ℓ13, 7⟩, ⟨ℓ14, 8⟩, ⟨ℓ15, 9⟩, ⟨ℓ3, 10⟩)
The path starts from 𝑠6, which is the nearest allocation operation preceding the error-triggering
point 𝑠10 on 𝜋. Although 𝑠7 : ⟨ℓ13, 7⟩lies between these two points, it is not designated as the
starting point because the variable 𝑗does not point to the error address. The operation at 𝑠8 : ⟨ℓ14, 8⟩
does manipulate the error address, but it does not qualify as a starting point because it does not
involve a memory allocation operation.
Correctness of Rule [ETP]. According to Definition 4 and Lemma 4.1, tMap records only the
statements related to the erroneously accessed memory address addr𝑒. Therefore, the constraints
in the rule guarantee that 𝑠𝑚is the error memory allocation (𝑠𝑚∈KS(tMap) ∧op(𝑠𝑚) = alloc) that
is the closest to 𝑠𝑛on 𝜋(∀𝑘,𝑚< 𝑘≤𝑛: 𝑠𝑘∉KS(tMap) ∨op(𝑠𝑘) ≠alloc).
□
Definition 6 (Context Trace g
Ctx𝑒). The context trace of an error-propagation path 𝜋𝑒is formally
defined as g
Ctx𝑒= (Ctx𝑠𝑖| 𝑠𝑖∈𝜋𝑒∧Sel(𝑠𝑖)), where each Ctx𝑠𝑖is the program context at 𝑠𝑖, and the
sequence follows the order of statements in 𝜋𝑒. Without loss of generality, the selection function
Sel is a predicate over the statements in 𝜋𝑒. In our definition, Sel(𝑠𝑖) returns true if 𝑠𝑖introduces
typestate changes of the memory object operated at the error-triggering point 𝑒.
[CXT]
𝑠𝑖∈𝜋𝑒
𝑠𝑖∈KS(tMap)
g
Ctx𝑒←g
Ctx𝑒◦tMap[𝑠𝑖]
Fig. 6. The inference rule for typestate-guided context trace construction. tMap[𝑠𝑖] retrieves the program
context related to 𝑠𝑖in tMap.
Typestate-Guided Context Trace Construction. Figure 6 shows the inference rule for con-
structing the context trace, denoted as g
Ctx𝑒, which captures the program contexts associated with
typestate-changing statements along an error-propagation path 𝜋𝑒. The rule iterates through each
statement 𝑠𝑖in 𝜋𝑒from index 𝑚to 𝑛. For each statement 𝑠𝑖, if it is a typestate-changing operation,
then the corresponding context is appended to g
Ctx𝑒.
Example 3. Figure 5 presents the context trace of Example 2. By iterating through the error-
propagation path, we append the relevant context information to the trace whenever a typestate-
changing operation is encountered in tMap. The resulting context trace, depicted in Figure 5,
consists of three program contexts, clearly showing the lifecycle of the memory, starting from its
allocation, through its release, and finally to its erroneous access.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
Correctness of Rule [CXT]. The rule ensures the order of context trace follows 𝜋𝑒(𝑠𝑖∈𝜋𝑒),
and also matches the semantics of Sel in Definition 6 (𝑠𝑖∈KS(tMap)) according to the definition
of tMap in Definition 4 and Lemma 4.1.
□
4.3
Prompting LLM for Error Repair
This section discusses how we utilize the information collected from the typestate-guided analysis
to prompt LLM for efficient memory error repair. The process begins by employing the role-play
technique [64] to define and motivate our task, followed by the use of structured prompting [27] to
direct the LLM in crafting a patch and explanation(s).
Role Play. Role play is an effective strategy known to enhance the precision of LLM outputs [64].
In our case, we configure the LLM to function as an APR tool, where it interprets error information
from the typestate-guided analysis and generates an appropriate patch with intuitive explanations.
Structured Prompting. We structure prompts to guide the LLM in generating patches by provid-
ing the necessary information to know the error location, understand the error context and path,
and then create a resolution. The error reports, along with the context trace and error-propagation
path guided by the typestate analysis, are formatted as prompts and fed into the LLM.
The error report guides the LLM in pinpointing the exact location of the memory error in the
codebase (§4.1). The initiation of the error report typically begins with a statement like, “Here is
the location of the use-after-free error in the provided code snippet”. The second component is the
context trace g
Ctx𝑒(Definition 6) that captures the context sequence of memory state transitions
leading to the error, enabling the LLM to comprehend the execution logic of the error. The final
component presents the error-propagation path 𝜋𝑒(Definition 5) to the LLM. This path, a critical
subsequence of the full execution path 𝜋, extends from the nearest memory allocation statement
to the error-triggering point on 𝜋, showing the program dependencies along an intact erroneous
memory management. Upon presenting all the necessary information for understanding the error
semantics and triggering logic, the LLM is then tasked with generating a patch to repair the error.
5
Evaluation
This section evaluates LTFix’s performance in repairing memory errors in real-world projects by
comparing it with two state-of-the-art memory error APR tools: SAVER [28] and ProveNFix [66].
LTFix successfully repairs 14.50× and 2.36× more memory errors than SAVER and ProveNFix,
respectively, while introducing no new errors. We also compare LTFix with LLM-based baselines
and conduct an ablation analysis to understand the contribution of each component.
5.1
Datasets and Implementation
Datasets. We first compare LTFix with SAVER and ProveNFix using the same dataset as
SAVER [28]. We meticulously reverse-engineer the vulnerability triggering conditions from SAVER’s
vulnerability set and construct proof of concept inputs to reproduce identical errors. From the
original collection, we exclude multi-threaded programs because our debugger-based approach
cannot effectively trace typestate transitions across multiple execution contexts, resulting in eight
projects for comparative evaluation. However, we identify limitations in SAVER’s dataset, including
its homogeneity and insufficient representation of real-world vulnerabilities. For instance, the same
version of recutiles [23] contains CVE-2019-6455 [57], which is absent from SAVER’s dataset. Fur-
thermore, fixes in SAVER’s dataset rely solely on manual verification without the crucial validation
provided by developer acceptance in real-world scenarios.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
Table 2. The statistics and description of the real-world projects used in the evaluation. LoC stands for lines
of code. #File and #Error represent the number of files and memory errors, respectively.
No.
Project (version)
LoC
#File
#Error
Description
1
ls_extended (9d899c8) [16]
1,352
25
3
ls with coloring and icons
2
xHTTP (72f82d) [20]
1,493
6
1
HTTP server library
3
tree (v1.8) [22]
3,435
13
3
utility to display a tree view of folders
4
chibicc (90df7f) [61]
9,688
71
5
C compiler
5
stb (v2.8) [62]
12,076
2
3
single-file public domain libraries for C/C++
6
scrot (b5e5f0d) [70]
13,130
36
1
command line screen capture utility
7
mjs (b1b6eac) [7]
32,116
202
1
embedded JavaScript engine
8
SmallerC (b120a9c) [2]
58,535
510
14
C compiler
9
MyHTML (90a853e) [1]
63,617
168
2
Fast C/C++ HTML 5 Parser
10
quickjs (d378a9f) [17]
86,281
53
2
embedded JavaScript engine
11
recutiles (v1.8) [23]
92,000
757
5
tools and libraries to access recfiles
12
wasm3 (139076a) [67]
111,616
696
4
WebAssembly interpreter
13
Yasm (ffbd22c) [60]
201,975
945
4
assembler
14
radare2 (8644a29) [24]
879,785
2,953
1
reverse engineering framework
Total
1,567,099
6,437
49
To address these limitations, we further construct a comprehensive benchmark comprising 14
real-world, open-source C projects, encompassing approximately 1.57 million lines of code across
diverse application domains including system utilities, network protocols, and media processing
libraries. Table 2 presents detailed statistics and descriptions for these projects. All memory errors in
our benchmark have been confirmed by respective project developers, with 9 vulnerabilities assigned
CVE identifiers and others documented in official commit histories. To establish reliable ground truth
for validating correct fixes, we collect patches approved and implemented by project maintainers.
Additionally, we curate error-inducing test cases (proof of concept inputs) to consistently reproduce
each error during evaluation.
Implementation. Our experiments are conducted on an Ubuntu 22.04 server with a 24-cores 5.60
GHz Intel CPU and 64 GB of memory. We employ Valgrind (version 3.18.1) [54] and ASan [63] for
error replay (§4.1). We use the GDB Python API (with GDB version 12.1) and the GNU C Library
(version 2.35) to perform typestate-guided context retrieval (§4.2). To reduce the time overhead
for extracting error-propagation paths and context traces, we set the initial breakpoint at memory
allocations instead of the program’s entry point. We use Claude 3.5 Sonnet [3, 12] as our LLM (§4.3).
For each prompt, we apply the chain-of-thought [78] method to encourage the model to think
step-by-step, thereby generating coherent and logically consistent responses. Given the inherent
randomness of LLMs, it is necessary to ensure the reliability of our results. Therefore, we conduct
each experiment five times for every project. We only consider a result to be valid if it exhibits
consistency in at least four out of the five runs.
5.2
Baselines
We compare LTFix with six baselines spanning traditional state-of-the-art memory APR tools and
LLM-based approaches:
SAVER and ProveNFix. To the best of our knowledge, SAVER [28] and ProveNFix [66] rep-
resent the current state-of-the-art in memory error APR, capable of addressing memory leaks,
use-after-frees, and double-frees. We employ their open-source implementations for comparative
analysis. These tools rely on detection results from the static analysis tool Infer [48]. To ensure
fair comparison, we augment these tools with precise error locations, identical to those provided to
LTFix, to direct targeted repairs. Futhermore, to mitigate potential limitations in static analysis,
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
we configure SAVER and ProveNFix with the most advanced parameters for Infer as described
in their respective publications. This configuration encompasses whole-program analysis of the
linked program, flow-sensitive analysis distinguishing control flows, and comprehensive header
file parsing [28, 66].
SWE-agent 1.0. Existing LLM-based APR tools primarily target Java and Python bugs and pre-
dominantly address single-hunk program repairs [85]. We evaluate LTFix against SWE-agent
1.0 [89] (hereafter referred to as SWE-agent), an open-source state-of-the-art LLM-based approach
that employs Chain-of-Thought reasoning and sophisticated prompt engineering to autonomously
invoke tools for error repair. To ensure methodological consistency and experimental validity, we
provide SWE-agent with identical reproducible errors, comprising comprehensive reproduction
workflows, toolchains, proof-of-concept inputs, compiled error versions, and precise trigger and
compilation commands.
LTFix-F and LTFix-M. To evaluate the capabilities of the base LLM used in LTFix without any
contextual retrieval from error trace, we implement two comparative baselines. The first baseline,
LTFix-F, incorporates the error report alongside program files directly implicated in the error-
triggering point. The second baseline, LTFix-M, extends this approach by including all functions
identified in the error backtrace in addition to the error report. Both baselines employ the same
structured role-play and prompting techniques delineated in §4.3 that are utilized in LTFix, thus
isolating the effect of context retrieval.
LTFix-NT. To understand the relative contributions of typestate-guided context retrieval to the
overall performance of LTFix, we conduct comprehensive ablation studies using LTFix-NT (LTFix
without context trace). LTFix-NT omits the context trace in the prompts while maintaining all
other components unchanged (such as bug report and error-propagation path), allowing us to
measure the specific impact of typestate-guided context trace information.
5.3
Evaluation Metrics
We collect the number of patches generated by each APR tool, denoted as #Δ. The number of
correct patches is represented as #Δ✓. To be classified as correct, a patch must: 1) successfully fix
the memory error; 2) preserve the expected outcomes for the project’s test suite; and 3) be manually
validated to align with the ground truth, adhering to the standards outlined in [42]. We use #Δ✗
to denote the count of patches introducing new errors, as confirmed through fuzzing tests [19].
Conversely, #ΔO denotes the quantity of patches that, while failing to correct the error, do not
introduce new ones. Finally, #𝐸✓represents the number of fixed errors. In terms of performance
evaluation, a higher value for both #Δ✓and #𝐸✓suggests superior performance, as it implies a
greater number of errors have been correctly repaired. Conversely, #Δ✗should ideally be minimized
to avoid the introduction of new errors. Although #ΔO does not pose a severe risk of introducing
new memory errors, a smaller value is still preferable. Note that we count memory errors by logical
"blocks" rather than by individual pointers. For memory leaks, each distinct allocation of a data
structure that fails to be deallocated is counted as a separate error, as each requires an independent
deallocation operation. This block-based accounting methodology accurately reflects the granularity
at which memory management operations must be applied in practice. Consequently, a single
patch that properly deallocates a complex data structure may fix multiple memory error blocks
simultaneously. Therefore, it is possible that the number of correct patches #Δ✓is smaller than the
number of fixed errors #𝐸✓.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
Table 3. Comparing LTFix with SAVER and ProveNFix using SAVER’s dataset [28].
Project
#E
SAVER [28]
#E
ProveNFix [66]
LTFix
rappel (ad8efd7)
1
1
1
1
1
lxc (72cc48f)
3
3
23
22
22
WavPack (22977b2)
1
0
12
12
12
flex (d3de49f)
3
0
4
4
4
p11-kit (ead7ara)
33
24
28
27
27
recutils (v1.8)
10
8
42
36
37
snort (v2.9.13)
16
10
42
13
18
grub
0
0
1
1
1
Total (Fixing Rate)
67
46 (68%)
153
116 (75.8%)
122 (79.7%)
5.4
Research Questions
We address the following research questions in our evaluation:
RQ1 Comparison with traditional APR tools: How effectively does LTFix repair memory
errors compared to state-of-the-art tools such as SAVER and ProveNFix, in terms of both
repair accuracy and error introduction?
RQ2 Comparison with LLM-based approach: To what extent does LTFix improve repair per-
formance compared to an open-source state-of-the-art LLM-based approach SWE-agent [89]?
RQ3 Ablation analysis: What is the relative contribution of typestate-guided context retrieval to
LTFix’s overall repair effectiveness?
5.5
Comparison with Traditional APR Tools (RQ1)
5.5.1
Comparison Results. Table 3 presents the detailed results of our comparative evaluation with
state-of-the-art memory error APR tools SAVER and ProveNFix using SAVER’s dataset. Our analysis
demonstrates that LTFix significantly outperforms both baseline tools, successfully addressing
122 out of 153 errors. The superior repair capabilities of LTFix are consistent across all projects
in the comparative evaluation. Table 4 provides a comprehensive comparison of LTFix against
SAVER and ProveNFix on our benchmark of 14 real-world projects. The empirical evidence clearly
establishes LTFix’s superior performance in error repair. Our approach successfully repairs 37 out
of 49 identified memory errors, representing 14.50× more repairs than SAVER and 2.36× more than
ProveNFix. Notably, LTFix achieves these substantial improvements without introducing any new
errors, unlike the baseline tools. The project-level analysis further confirms LTFix’s consistency, as
it repairs at least as many errors as the baseline tools across all projects while maintaining nearly
zero error introduction rates.
5.5.2
Case Study. We showcase the superior performance of our tool compared to SAVER and
ProveNFix through four typical code scenarios depicted in Figure 7.
Complex Data Structure. Figure 7(a) demonstrates LTFix’s in-context repair capability to handle
complex data structure. This example presents a memory leak issue where primitive deallocation
functions (free(db)) are inappropriately used for complex data structures. SAVER and ProveNFix,
despite their sophisticated constraint-based and specification-driven methodologies, fail to gen-
erate patches because their analysis erroneously concludes that memory deallocation is already
addressed—the code already contains deallocation logic (free(db)), which superficially satisfies
basic memory management constraints. These tools typically excel at adding missing deallocation
functions when none exist but struggle with identifying and replacing inadequate deallocation
implementations that fail to account for complex data structures with nested memory allocations.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
Table 4. Comparing LTFix with SAVER and ProveNFix using the 14 real-world projects in Table 2. For simplicity,
we omit the project versions. #𝐸denotes the number of memory errors.
SAVER [28]
ProveNFix [66]
LTFix
Project
#𝐸
#Δ
#Δ✓
#ΔO
#Δ✗
#𝐸✓
#Δ
#Δ✓
#ΔO
#Δ✗
#𝐸✓
#Δ
#Δ✓
#ΔO
#Δ✗
#𝐸✓
ls_extended
3
1
1
0
0
1
1
1
0
0
3
1
1
0
0
3
xHTTP
1
0
0
0
0
0
1
0
1
0
0
1
1
0
0
1
tree
2
0
0
0
0
0
0
0
0
0
0
1
1
0
0
2
chibicc
5
0
0
0
0
0
2
1
0
1
3
3
2
1
0
5
stb
3
0
0
0
0
0
0
0
0
0
0
3
3
0
0
3
scrot
1
0
0
0
0
0
1
0
1
0
0
1
1
0
0
1
mjs
1
1
1
0
0
1
1
1
0
0
1
1
1
0
0
1
smallC
14
1
0
0
1
0
2
1
0
1
1
7
4
2
1
12
MyHTML
2
0
0
0
0
0
1
1
0
0
2
1
1
0
0
2
quickjs
2
0
0
0
0
0
0
0
0
0
0
2
2
0
0
2
recutiles
6
0
0
0
0
0
0
0
0
0
0
2
2
0
0
4
wasm3
4
0
0
0
0
0
0
0
0
0
0
1
1
0
0
2
yasm
4
0
0
0
0
0
0
0
0
0
0
2
1
1
0
2
radare2
1
0
0
0
0
0
1
1
0
0
1
1
1
0
0
1
Total
49
3
2
0
1
2
10
6
2
2
11
27
22
4
1
37
In contrast, LTFix successfully identifies that the existing deallocation strategy is insufficient and
generates the correct fix by replacing free(db) with the structure-specific rec_db_destroy(db)
function. This solution properly addresses the memory leaks by recursively deallocating all in-
ternal components of the db structure before releasing the main object itself. The distinctive
performance of LTFix stems from its advanced context-awareness mechanisms that not only de-
tect missing deallocations but also evaluate the adequacy of existing memory management code
against codebase-specific patterns. By analyzing allocation-deallocation pairings throughout the
codebase and understanding the structural complexity of data types, LTFix can determine when
simple deallocation functions are insufficient and recommend appropriate domain-specific alterna-
tives. Furthermore, since the allocation and deallocation of the object occur in different functions,
LTFix’s interprocedural memory error semantics understanding allows it to track the object’s
lifecycle across function boundaries, enabling the generation of semantically integrated patches
that correctly address the memory leak.
Intertwined Logic and Memory. Figure 7(b) illustrates an example of intertwined logic and mem-
ory repair that presents significant challenges for SAVER and ProveNFix. Addressing this memory
leak necessitates a three-step coordinated modification across disjoint code regions: (1) introducing
a temporary variable (str) to track memory allocated by strdup(tmpl), (2) inserting deallocation
instructions (free(str)) before the function returns, and (3) restructuring the control/data flow by
placing free(str) after the use of filename and storing the return value in an intermediate vari-
able result rather than immediately returning it. This case represents a complex interdependence
between program logic and memory management, where an incorrect modification to either aspect
could result in functional errors.
In this scenario, LTFix demonstrates superior performance by accurately identifying and resolv-
ing both the file handling logic and memory management issues. Neither SAVER nor ProveNFix
could repair the error within the original code structure where memory allocation occurs in a
nested function call. To further evaluate their capabilities, we manually restructured the code
by splitting char *filename = basename(strdup(tmpl)); at origin ℓ365 into two distinct state-
ments: char *str = strdup(tmpl); and char *filename = basename(str);. With this simplifi-
cation, both SAVER and ProveNFix could generate patches trying to address the memory leak.
However, they erroneously positioned the free(str); statement before the return format("%s%s",
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
--- a/xhttp.c
+++ b/xhttp.c
@@ -835,8 +835,9 @@parse(..., req)
   req->headers=headers; // alias
   req->method =xh_string_new(...);
+  headers.list = NULL;
  ...
866  if (...){
867     free(headers.list); // free
868     return FAILURE(...);
(c) Subtle Pointer Aliasing
--- a/main.c
+++ b/main.c
@@ -365,11 +365,13 @@replace_extn(...) {
-   char *filename = basename(strdup(tmpl));
+   char *str = strdup(tmpl);
+   char *filename = basename(str);
    char *dot = strrchr(filename, '.');
    if (dot)
        *dot = '\0';
-   return format("%s%s", filename, extn);
+   char *result = format("%s%s", filename, extn);
+   free(str);
+   return result;
(b) Intertwined Logic and Memory 
(d) Cyclic Allocation
--- a/src/note.c
+++ b/src/note.c
    while (token) {
       case ...:
@@ -119,7 +119,10 @@scrotNoteNew(...)
      switch (type) {
       case 'f':
+        if (note) 
+          free(note->font);
         // note->font alloc in parseText
         note->font = parseText(&token, end);
(a) Complex Data Structure
--- a/recutl.c
+++ b/recutl.c
   rec_db_t db;               
   db = rec_db_new (); // alloc
@@ -331,7 +331,6 @@recutl_build_db (...)
   if (!recutl_parse_db_from_file (...))
   {
-     free (db);
-     db = NULL;
+     rec_db_destroy(db);
   }
Fig. 7. LTFix’s patches for: (a) a memory leak in the recutiles [23] project; (b) a memory leak in the
chibicc [61] project; (c) a double-free vulnerability (CVE-2023-38434 [58]) in the xHTTP [20] project; and (d)
a memory leak error in the scrot [70] project.
filename, extn); at origin ℓ369, introducing a use-after-free vulnerability since filename refer-
ences memory within str, which would be prematurely deallocated. LTFix, conversely, correctly
determines that memory deallocation must occur after all uses of dependent pointers derived from
the allocated memory, and generates a semantically sound patch that preserves functionality while
eliminating the memory leak.
Subtle Pointer Aliasing. Figure 7(c) demonstrates LTFix’s capability in addressing subtle pointer
aliasing scenarios. This vulnerability exemplifies a subtle memory management issue where two
pointers, headers and req->headers at ℓ835, reference the same memory region, potentially result-
ing in double deallocation when memory is freed in both the close_connection() and parse()
functions. The challenge here lies not in the syntactic complexity of the code, but in the sophisti-
cated semantic understanding required to trace pointer relationships across function boundaries
and execution paths. Such scenarios necessitate comprehension of pointer aliasing to properly
identify shared memory references across different control points and contexts.
SAVER reports "failed to convert labeling operators," indicating its inability to identify the
appropriate location for memory deallocation. ProveNFix generates a patch that introduces an early
return before the first free() operation at ℓ867, rather than implementing the necessary memory
lifecycle management by setting essential pointer to NULL after deallocation. This approach,
while preventing the immediate crash, introduces premature termination of function execution,
potentially leading to resource leaks and incomplete functionality. ProveNFix excels at identifying
error conditions but struggles with synthesizing correct fixes that require understanding subtle
memory sharing semantics. Consequently, ProveNFix opts for a conservative solution that avoids
error manifestation rather than addressing the underlying memory management issue at the
exact program point where nullification is required. In contrast, LTFix comprehends the aliasing
relationships between the pointers through the context trace, enabling it to generate a precise and
contextually appropriate patch. It correctly inserts a nullification for headers.list after the pointer
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
#∆
#∆✓
#∆O
#∆X
#E✓
0
5
10
15
20
25
30
35
40
45
Count
17
11
1
5
19
27
22
4
1
37
SWE-agent
LTFix (Ours)
Fig. 8. Comparison of fixing effectivenss between LTFix and SWE-agent [89] in our dataset.
copy operation, rather than the conventional practice of inserting NULL after freeing, thereby
effectively resolving the double-free vulnerability.
Cyclic Allocation. Figure 7(d) illustrates LTFix’s effectiveness in addressing cyclic allocation
scenarios. The error manifests within a while loop where, upon consecutive iterations through the
same conditional branch, the program repeatedly allocates memory for the note->font pointer
without first deallocating previously allocated memory resources. This implementation deficiency
creates orphaned memory blocks, as each subsequent allocation overwrites the reference to previ-
ously allocated memory without proper deallocation, thereby causing a persistent memory leak.
Unlike conventional memory management errors where resources simply remain unreleased, this
particular error requires understanding that deallocation must precede reallocation within a cyclic
execution context. The non-local nature of the required fix—inserting logic before allocation rather
than at the typical post-operation release points—exceeds the pattern-matching capabilities of
SAVER and ProveNFix. The patch generated by LTFix establishes a robust resource management
guard pattern that first validates the existing resource allocation status if (note) (preventing
potential double-free vulnerabilities) followed by explicit deallocation free(note->font) prior to
subsequent memory allocation operations within the same scope.
Answer to RQ1: LTFix outperforms traditional state-of-the-art memory error APR tools SAVER
and ProveNFix across all projects in both SAVER’s dataset and our comprehensive benchmark.
This superior performance stems from LTFix’s typestate-guided context retrieval that enables
sophisticated reasoning about cross-procedural complex memory error semantics.
5.6
Comparison with LLM-based Approach (RQ2)
We conduct a comprehensive comparative analysis between LTFix and SWE-agent [89], an open-
source state-of-the-art LLM-based program improvement tool.
Comparison Results. As illustrated in Figure 8, LTFix demonstrates substantial performance
improvements over SWE-agent across critical evaluation metrics. LTFix successfully repairs 37
memory errors, representing a 94.7% increase compared to the 19 errors addressed by SWE-agent.
Quantitatively, our approach generated 27 patches versus SWE-agent’s 17 patches. The qualitative
differential is even more pronounced: LTFix produces 22 correct patches—representing an 81.5%
accuracy rate—compared to SWE-agent’s 10 correct patches (58.8% accuracy). Furthermore, LTFix
substantially reduces the generation of harmful patches that introduce new errors, with only 1
instance compared to SWE-agent’s 5. These results empirically demonstrate that augmenting LLMs
with typestate-guided context retrieval yields significantly higher repair precision than relying
solely on the LLM’s intrinsic reasoning capabilities.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Projects No.
104
105
106
Number of Tokens
67,940
5,897
(-91.3%)
1,678,179
5,564
(-99.7%)
108,336
93,282
(-13.9%)
2,197,113
87,350
(-96.0%)
531,668
22,397
(-95.8%)
1,689,888
4,422
(-99.7%)
471,488
3,306
(-99.3%)
517,255
51,180
(-90.1%)
115,533
3,254
(-97.2%)
1,598,832
10,937
(-99.3%)
3,392,231
92,271
(-97.3%)
3,423,175
9,762
(-99.7%)
1,596,581
19,948
(-98.8%)
48,025
2,330
(-95.1%)
SWE-agent
LTFix (Ours)
Fig. 9. Comparison of the number of tokens consumed by LTFix and SWE-agent [89] for memory error repair
in our dataset.
SWE-agent is unable to repair the memory leak in Figure 7(a) because it lacks awareness of the
database object’s allocation/deallocation typestate transitions context. Without comprehensive
visibility into the db structure’s allocation pattern and complete memory lifecycle, SWE-agent
cannot reliably determine whether rec_db_destroy(db) constitutes the appropriate replacement,
as it must consider potential risks such as double-free vulnerabilities. Similarly, SWE-agent fails
to comprehend the intricate pointer alias relationships in Figure 7(c) without the execution con-
text that captures memory references and state transitions. Moreover, it demonstrates inadequate
performance when addressing the cyclic allocation scenario in Figure 7(d), primarily due to its
inability to perform interprocedural memory lifecycle tracking and to identify the recurring al-
location pattern that characterizes this particular vulnerability. These limitations underscore the
fundamental advantage of typestate-guided context retrieval in providing the necessary semantic
understanding for effective memory error repair, highlighting the qualitative difference between
LTFix’s targeted approach and the quantitative expansion of LLM context windows alone.
Token Consumption Analysis. As illustrated in Figure 9, LTFix consumes significantly fewer
tokens—approximately 42 times less than SWE-agent’s total token usage. This substantial efficiency
gap is consistently observed across all evaluated projects. The remarkable reduction in token
consumption can be attributed to our typestate-guided context retrieval mechanism, which precisely
extracts contextually relevant information necessary for generating accurate patches without
incurring excessive computational overhead. In contrast, SWE-agent relies solely on the LLM’s
intrinsic reasoning capabilities without any specialized guidance for contextual prioritization.
These empirical results demonstrate that LTFix’s strength extends beyond merely producing more
patches; it generates substantially more accurate and less harmful repairs while addressing a greater
number of errors—all with significantly higher token efficiency. This synergistic combination of
repair quality and computational efficiency renders LTFix both more effective and more economical
for practical deployment of LLM-based APR in real-world memory error repair scenarios.
Answer to RQ2: LTFix outperforms SWE-agent [89] in terms of both repair accuracy and
efficiency. Our approach generates substantially more correct patches, fewer harmful patches,
and fixes more errors while consuming significantly fewer tokens, demonstrating superior
performance in memory error repair.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
LTFix-F
LTFix-M
LTFix-NT
LTFix
0
5
10
15
20
25
#∆✓
7
11
15
22
(a) correct patches
LTFix-F
LTFix-M
LTFix-NT
LTFix
0
1
2
3
4
5
6
7
#∆X
3
6
4
1
(b) introducing new errors
LTFix-F
LTFix-M
LTFix-NT
LTFix
0
5
10
15
20
25
30
35
40
#E✓
15
18
22
37
(c) ﬁxed errors
Fig. 10. Ablation analysis result.
5.7
Ablation Analysis (RQ3)
Figure 10 presents the ablation analysis result comparing LTFix with three variants (see §5.2):
LTFix-M (LTFix with the methods containing the error), LTFix-F (LTFix with the file containing
the error) and LTFix-NT (LTFix without context trace). We aim to understand the capability of the
base LLM used in LTFix and the benefits brought by the typestate-guided context retrieval.
Correct Patches. As shown in Figure 10(a), LTFix-F and LTFix-M, which provide file-level
and method-level information respectively, demonstrate the lowest efficacy in generating correct
patches, producing only 7 and 11 correct patches. LTFix-NT generates a higher number of correct
patches than both baseline variants, achieving 15 correct patches. However, the complete LTFix
system with intact typestate-guided contextual information significantly outperforms all variants
with 22 correct patches, demonstrating a 47% improvement over LTFix-NT and a 214% improvement
over LTFix-F.
Introducing New Errors. Figure 10(b) illustrates the number of new errors introduced by each
variant. LTFix-F and LTFix-M introduce 3 and 6 new errors respectively, while LTFix-NT introduces
4 new errors. In contrast, LTFix introduces only 1 new error, representing an 83% reduction
compared to the other variants. This significant decrease demonstrates that the integration of
typestate-guided context retrieval is critical for preventing the introduction of new errors during
the repair process.
Fixed Errors. Figure 10(c) presents the total number of errors fixed by each variant. The baseline
variants LTFix-F and LTFix-M fix 15 and 18 errors respectively, while the enhanced variant LTFix-
NT shows incremental improvement by fixing 22 errors. The complete LTFix system demonstrates
superior performance by fixing 37 errors, which represents a 68% improvement over LTFix-NT and
a 147% improvement over LTFix-F. This gap confirms that each component contributes significantly
to the overall effectiveness of our approach, with typestate-guided context retrieval yielding
synergistic benefits beyond what either component achieves independently.
To illustrate the critical importance of typestate-guided context retrieval, we examine the cases
in Figure 7 (c) and (d). Neither these variants can successfully repair these complex memory
errors, as they lack the necessary execution context. Without typestate-guided context tracing,
LLMs cannot identify critical semantic relationships—specifically, that pointer aliasing causes the
double-free vulnerability in case (c), and that note->font undergoes multiple allocations without
proper deallocation in case (d). These examples demonstrate the fundamental necessity of context
tracing in memory error repair. In Figure 7 (c), context tracing enables the system to observe subtle
changes in multiple pointers’ states and track function call propagation across execution boundaries.
Similarly, in Figure 7 (d), it facilitates the tracking of note->font’s allocation state across diverse
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
execution paths. In the absence of such context tracing, the LLM lacks comprehensive visibility
into the object’s complete lifecycle, rendering it impossible to determine that subsequent memory
allocations occur without proper deallocation of previously allocated resources. This limitation
fundamentally impedes the LLM’s ability to generate semantically correct patches that address the
underlying memory management deficiencies.
Token Consumption Analysis. To quantitatively assess the efficiency of typestate-guided
context retrieval, we conduct a comparative analysis between our approach and the use of full
context traces. The experimental results demonstrate that LTFix consumes only 411,900 tokens in
total, whereas utilizing the complete context trace requires 22,950,414 tokens—a reduction factor
exceeding 50×. This substantial efficiency gain empirically validates the effectiveness of our targeted
context retrieval strategy. By selectively extracting only the most semantically relevant information,
LTFix maintains superior repair quality while dramatically reducing computational overhead,
making it both more performant and more economical for practical deployment of LLM-based APR
in real-world environments.
Answer to RQ3: The complete system’s integration of all components in LTFix yields superior
results across metrics, notably reducing new error introduction by enabling better understanding
of error semantics, generating more precise repairs and consuming significantly less tokens.
6
Threats to Validity
Dataset. A potential threat to validity concerns the possible inclusion of our evaluated open-
source projects and patches in the training dataset of the employed LLMs, which could introduce
evaluation bias. Ideally, experiments would utilize new, previously unseen memory errors to
completely eliminate this possibility. However, this limitation affects all LLM-based baselines
equally in our comparative analysis, and our approach consistently demonstrates significant
performance improvements over these baselines, suggesting that the core contributions of our
typestate-guided context retrieval mechanism extend beyond any potential advantages from data
exposure.
LLM Selection. Our evaluation primarily utilizes Claude 3.5 Sonnet, although empirical evidence
suggests that comparable LLMs (e.g., GPT-4o) demonstrate similar performance on memory error
APR tasks. While more sophisticated models might offer improvements, the focal point of our
research is the enhancement of memory error APR through typestate-guided context retrieval,
rather than a comparative assessment of performance across different LLM architectures.
Repair Scope. Our approach does not aim to detect or repair all possible memory errors within a
repository, but rather provides a targeted solution for memory errors that have been reproduced
with a specific proof of concept. This may constrain the generalizability to memory errors beyond
the provided proof of concept. However, to ensure methodological fairness in our comparative
evaluation, all baseline approaches are provided with identical proof-of-concept demonstrations,
and our approach consistently outperformed them under these controlled conditions.
7
Related Work
We review relevant literature across three primary domains: specialized techniques for automated
memory error repair, the emerging integration of large language models in program repair frame-
works, and the application of large language models for advanced program analysis. Through this
examination, we position our approach within the broader research landscape while highlighting
the limitations of existing methods when addressing complex memory management challenges.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
Automated Memory Error Repair. Repairing memory errors is a complex task due to the
non-local nature of memory management and its temporal properties. Various efforts have been
proposed to address this issue [25, 28, 38, 47, 53, 55, 56, 66, 72]. SemFix [55] and Angelix [47]
are general-purpose repair techniques that, while broadly applicable, demonstrate lower efficacy
compared to specialized approaches tailored for specific error categories such as memory errors [28]
and null dereferences [87]. AddressWatcher [53] can only fix memory leaks and cannot be applied
to use-after-frees and double-frees. While MemFix [38] is effective for small-scale programs, it
struggles to scale up for larger applications and fails to generate patches that include conditional
deallocation for safety checks. FootPatch [72] requires templated annotations at the bug locations
and can inadvertently introduce double-free errors when addressing memory leaks. SAVER [28]
is more scalable for larger applications, but it does not take advantage of the intermediate bug
information provided by Infer, which inhibits its effectiveness. ProveNFix [66] uses temporal
property-based specifications, referred to as future conditions, to repair memory errors and other
temporal bugs. However, a common limitation among these existing memory error APR tools is
their dependence on manually crafted specifications. In contrast, our tool can infer a correct fix
without the need for explicitly defined rules. Moreover, compared to LTFix, existing solutions often
struggle to fix inter-procedural multi-hunk bugs and have difficulty leveraging in-context repair
beyond using primitive APIs.
LLMs for Automated Program Repair. Recent advancements in Automated Program Repair
(APR) [36] have witnessed the emergence of LLM-based techniques, which can be categorized into
two primary approaches: Open-Source-LLM-based [29, 32, 45, 81, 90, 91, 93] and Closed-Source-
LLM-based [33, 34, 77, 79, 82, 83, 89, 92, 94]. Open-Source-LLM-based methods typically necessitate
substantial additional data for model fine-tuning. For instance, Mashhadi et al. [45] curated over
600,000 samples to enhance Java single-statement bug repair capabilities. However, such extensive
datasets are particularly challenging to assemble for memory error repairs due to their specialized
nature. Our approach aligns more closely with Closed-Source-LLM-based methodologies, which
leverage the inherent capabilities of pre-trained LLMs as their foundation, subsequently augmenting
them with external contextual information [34, 82] or sophisticated decision-making chains [83, 94].
Recent research [77, 89, 92] has further evolved this paradigm by employing agent-based frame-
works that enhance the repair process through interactive engagement with isolated computational
environments. While our approach shares the utilization of Closed-Source LLMs with existing
methods, it fundamentally differs in both focus and implementation. Prior approaches predomi-
nantly target general bug fixing (mostly in Java and Python) but face limitations when addressing
memory errors due to the extensive contextual information required. Our novel contribution lies in
the development of typestate-guided context retrieval, which effectively compresses error-related
contexts to lengths suitable for LLM processing while preserving critical semantic information
necessary for accurate memory error repair.
LLMs for Program Analysis. LLMs have demonstrated remarkable capabilities in reasoning
about complex program semantics and performing sophisticated program analysis. For instance,
Li et al. [39] effectively combine static analysis with LLMs to detect Use Before Initialization
(UBI) bugs within the Linux kernel, exemplifying the potential of LLMs in understanding complex
programming semantics. Similarly, Huang et al. [30] utilize the in-context learning capability
of LLMs to elucidate program dependencies, highlighting their potential in clarifying intricate
program structures. In another study, Wen et al. [80] decompose programs and employ LLMs to
synthesize specifications for automated program verification. Cheng et al. [10] propose a semantic-
enhanced approach based on LLMs to improve the efficiency of indirect call analysis. Wang et
al. [74] present an LLM-powered compilation-free and customizable dataflow analysis. While the
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
focus of these studies differs from ours, the combination of program analysis for understanding
program semantics and the in-context learning ability of LLMs has inspired us to leverage LLMs in
the development of LTFix.
8
Conclusion
The paper presents LTFix, a novel approach to automated memory error repair in C programs using
Large Language Models (LLMs). By leveraging LLMs’ extensive knowledge of code and natural
language, guided by a finite typestate automaton and structured prompting, LTFix addresses the
limitations of traditional automated memory error repair methods and previous deep learning
approaches. Our tool demonstrates significant success in repairing real-world memory errors across
large-scale open-source projects, outperforming existing state-of-the-art tools and even fixing
three zero-day memory errors. This approach shows promise in advancing the field of automated
program repair, particularly for complex memory-related errors in C programming.
Data Availability Statement
This paper is currently under review. All implementation details and associated data are available
to reviewers and will be made publicly available upon acceptance.
References
[1] Alexander Borisov. 1999. Fast C/C++ HTML 5 Parser. https://github.com/lexborisov/myhtml
[2] Alexey Frunze. 2021. Smaller C is a simple and small single-pass C compiler. https://github.com/alexfru/SmallerC
[3] Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet
[4] Eric Bodden. 2010. Efficient hybrid typestate analysis by determining continuation-equivalent states. In 2010 ACM/IEEE
32nd International Conference on Software Engineering (ICSE ’12). ACM.
[5] Nikita Borisov, George Danezis, Prateek Mittal, and Parisa Tabriz. 2007. Denial of service or denial of security?. In
Proceedings of the 14th ACM conference on Computer and communications security (CCS ’07). ACM.
[6] Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa. 2012. Undangle: early detection of dangling
pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software
Testing and Analysis (ISSTA ’12).
[7] Cesanta Software Limited. 2023. mJS: Restricted JavaScript engine. https://github.com/cesanta/mjs
[8] Haogang Chen, Yandong Mao, Xi Wang, Dong Zhou, Nickolai Zeldovich, and M Frans Kaashoek. 2011. Linux kernel
vulnerabilities: State-of-the-art defenses and open problems. In Proceedings of the Second Asia-Pacific Workshop on
Systems. ACM.
[9] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus.
2021. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Trans. Software Eng. (2021).
[10] Baijun Cheng, Cen Zhang, Kailong Wang, Ling Shi, Yang Liu, Haoyu Wang, Yao Guo, and Xiangqun Chen. 2024.
Semantic-Enhanced Indirect Call Analysis with Large Language Models. In 39th IEEE/ACM International Conference on
Automated Software Engineering (ASE ’24). IEEE/ACM.
[11] Xiao Cheng, Jiawei Ren, and Yulei Sui. 2024. Fast Graph Simplification for Path-Sensitive Typestate Analysis through
Tempo-Spatial Multi-Point Slicing. Proc. ACM Softw. Eng. FSE (2024).
[12] CISA. 2023. The Urgent Need for Memory Safety in Software Products. https://www.cisa.gov/news-events/news/
urgent-need-memory-safety-software-products
[13] Manuvir Das, Sorin Lerner, and Mark Seigle. 2002. ESP: Path-Sensitive Program Verification in Polynomial Time.
In Proceedings of the ACM SIGPLAN 2002 conference on Programming language design and implementation (PLDI ’02).
ACM.
[14] Yangruibo Ding, Baishakhi Ray, Premkumar T. Devanbu, and Vincent J. Hellendoorn. 2020. Patching as Translation:
the Data and the Metaphor. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20).
[15] DWARF Debugging Information Format Committee. 2017. DWARF Debugging Information Format Version 5. https:
//dwarfstd.org/doc/DWARF5.pdf
[16] Electrux. 2024. ls with coloring and icons. https://github.com/Electrux/ls_extended
[17] Fabrice Bellard. 2021. QuickJS Javascript Engine. https://github.com/bellard/quickjs
[18] Stephen J. Fink, Eran Yahav, Nurit Dor, G. Ramalingam, and Emmanuel Geay. 2006. Effective typestate verification in
the presence of aliasing. In Proceedings of the ACM/SIGSOFT International Symposium on Software Testing and Analysis
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
(ISSTA ’06). ACM.
[19] Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++ : Combining Incremental Steps of
Fuzzing Research. In 14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association.
[20] Francesco Cozzuto. 2022. A lightweight HTTP server as a library. https://github.com/cozis/xHTTP
[21] Free Software Foundation. 2011. Debugging with GDB. https://sourceware.org/gdb/current/onlinedocs/gdb.html
[22] Free Software Foundation, Inc. 1991. A handy little utility to display a tree view of directories. https://github.com/
execjosh/tree
[23] Free Software Foundation, Inc. 2007. GNU Recutils. https://www.gnu.org/software/recutils/
[24] Free Software Foundation, Inc. 2007. Radare2: Libre Reversing Framework for Unix Geeks.
https://github.com/
radareorg/radare2
[25] Qing Gao, Yingfei Xiong, Yaqing Mi, Lu Zhang, Weikun Yang, Zhaoping Zhou, Bing Xie, and Hong Mei. 2015. Safe
Memory-Leak Fixing for C Programs. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering
(ICSE ’15).
[26] Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey. IEEE Transactions
on Software Engineering (2019).
[27] Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, and Furu Wei. 2022. Structured Prompting: Scaling In-Context
Learning to 1,000 Examples.
[28] Seongjoon Hong, Junhee Lee, Jeongsoo Lee, and Hakjoo Oh. 2020. SAVER: scalable, precise, and safe memory-error
repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). ACM.
[29] Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2023. An Empirical
Study on Fine-Tuning Large Language Models of Code for Automated Program Repair. In 38th IEEE/ACM International
Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023. IEEE.
[30] Qing Huang, Zhiwen Luo, Zhenchang Xing, Jinshan Zeng, Jieshan Chen, Xiwei Xu, and Yong Chen. 2024. Revealing
the Unseen: AI Chain on LLMs for Predicting Implicit Data Flows to Generate Data Flow Graphs in Dynamically-Typed
Code. ACM Transactions on Software Engineering and Methodology (2024).
[31] Nan Jiang, Thibaud Lutellier, Yiling Lou, Lin Tan, Dan Goldwasser, and Xiangyu Zhang. 2023. KNOD: Domain
Knowledge Distilled Tree Decoder for Automated Program Repair. In 45th IEEE/ACM International Conference on
Software Engineering (ICSE ’23). IEEE.
[32] Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural Machine Translation for Automatic
Program Repair. In 43rd IEEE/ACM International Conference on Software Engineering (ICSE ’21). IEEE.
[33] Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023.
InferFix: End-to-End Program Repair with LLMs. In Proceedings of the 31st ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering (FSE ’23). ACM.
[34] Harshit Joshi, José Pablo Cambronero Sánchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radicek. 2023.
Repair Is Nearly Generation: Multilingual Program Repair with LLMs. In Thirty-Seventh AAAI Conference on Artificial
Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth
Symposium on Educational Advances in Artificial Intelligence, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.).
AAAI Press.
[35] Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. In 2016 IEEE 23rd International
Conference on Software Analysis, Evolution, and Reengineering (SANER ’16).
[36] Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair. Commun. ACM (2019).
[37] Byoungyoung Lee, Chengyu Song, Yeongjin Jang, Tielei Wang, Taesoo Kim, Long Lu, and Wenke Lee. 2015. Preventing
Use-after-free with Dangling Pointers Nullification. In NDSS.
[38] Junhee Lee, Seongjoon Hong, and Hakjoo Oh. 2018. MemFix: static analysis-based repair of memory deallocation
errors for C. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (FSE ’18). ACM.
[39] Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An
LLM-Integrated Approach. Proceedings of the ACM on Programming Languages OOPSLA1 (2024).
[40] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. DLFix: Context-based Code Transformation Learning for Automated
Program Repair. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 602–614.
[41] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. DEAR: a novel deep learning-based approach for automated program
repair. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). ACM.
[42] Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F. Bissyandé, Dongsun Kim, Peng Wu, Jacques
Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the efficiency of test suite based program repair: A Systematic
Assessment of 16 Automated Repair Systems for Java Programs. In Proceedings of the ACM/IEEE 42nd International
Conference on Software Engineering (ICSE ’20). ACM.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
[43] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost
in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics
(2024).
[44] Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. CoCoNuT: combining
context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT
International Symposium on Software Testing and Analysis (ISSTA ’20). ACM.
[45] Ehsan Mashhadi and Hadi Hemmati. 2021. Applying CodeBERT for Automated Program Repair of Java Simple Bugs.
In 18th IEEE/ACM International Conference on Mining Software Repositories (MSR ’21). IEEE.
[46] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. DirectFix: Looking for Simple Program Repairs. In 2015
IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE ’15).
[47] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis
via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering. ACM, 691–701.
[48] Meta. 2021. A static analyzer for Java, C, C++, and Objective-C. https://fbinfer.com/
[49] MITRE. 2024. CWE-401: Missing Release of Memory after Effective Lifetime. https://cwe.mitre.org/data/definitions/
401.html
[50] MITRE. 2024. CWE-415: Double Free. https://cwe.mitre.org/data/definitions/415.html
[51] MITRE. 2024. CWE-416: Use After Free. https://cwe.mitre.org/data/definitions/416.html
[52] Martin Monperrus. 2018. Automatic Software Repair: A Bibliography. ACM Comput. Surv. (2018).
[53] A. Murali, M. Alfadel, M. Nagappan, M. Xu, and C. Sun. 2024. AddressWatcher: Sanitizer based Localization of Memory
Leak Fixes. IEEE Transactions on Software Engineering (2024).
[54] Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation.
In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07).
ACM.
[55] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. SemFix: Program repair via
semantic analysis. In 2013 35th International Conference on Software Engineering (ICSE). 772–781. https://doi.org/10.
1109/ICSE.2013.6606623
[56] Thanh-Toan Nguyen, Quang-Trung Ta, Ilya Sergey, and Wei-Ngan Chin. 2021. Automated Repair of Heap-Manipulating
Programs Using Deductive Synthesis. In Verification, Model Checking, and Abstract Interpretation, Fritz Henglein, Sharon
Shoham, and Yakir Vizel (Eds.).
[57] NIST. 2019. CVE-2019-6455. https://nvd.nist.gov/vuln/detail/CVE-2019-6455
[58] NIST. 2023. CVE-2023-38434. https://nvd.nist.gov/vuln/detail/CVE-2023-38434
[59] Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. 2023. Can large language models reason
about program invariants?. In International Conference on Machine Learning. PMLR, 27496–27520.
[60] Peter Johnson and other Yasm developers. 2014. Yasm Assembler mainline development tree. https://yasm.tortall.net/
[61] Rui Ueyama. 2019. chibicc: A Small C Compiler. https://github.com/rui314/chibicc.git
[62] Sean Barrett. 2017. stb single-file public domain libraries for C/C++. https://github.com/nothings/stb
[63] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. AddressSanitizer: A Fast
Address Sanity Checker. In 2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX Association.
[64] Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models. Nature 623, 7987
(2023), 493–498.
[65] Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In Proceedings of the
2020 ACM SIGSAC conference on computer and communications security. 377–390.
[66] Yahui Song, Xiang Gao, Wenhua Li, Wei-Ngan Chin, and Abhik Roychoudhury. 2024. ProveNFix: Temporal Property-
Guided Program Repair. Proc. ACM Softw. Eng. FSE (2024).
[67] Steven Massey, Volodymyr Shymanskyy. 2019. A fast WebAssembly interpreter and the most universal WASM runtime.
https://github.com/wasm3/wasm3
[68] Robert E. Strom and Shaula Yemini. 1986. Typestate: A programming language concept for enhancing software
reliability. IEEE Transactions on Software Engineering (1986).
[69] Yulei Sui, Ding Ye, and Jingling Xue. 2012. Static memory leak detection using full-sparse value-flow analysis. In
Proceedings of the 2012 International Symposium on Software Testing and Analysis (ISSTA ’12).
[70] Tom Gilbert. 2000. SCReenshOT - command line screen capture utility. https://github.com/resurrecting-open-source-
projects/scrot
[71] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019.
An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Trans. Softw.
Eng. Methodol. (2019).
[72] Rijnard van Tonder and Claire Le Goues. 2018. Static automated program repair for heap properties. In Proceedings of
the 40th International Conference on Software Engineering (ICSE ’18). ACM.
, Vol. 1, No. 1, Article . Publication date: June 2025.

Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
[73] Chengpeng Wang, Jipeng Zhang, Rongxin Wu, and Charles Zhang. 2024. DAInfer: Inferring API Aliasing Specifications
from Library Documentation via Neurosymbolic Optimization. Proc. ACM Softw. Eng. FSE (2024).
[74] Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, and Xiangyu Zhang. 2025. LLMDFA: Analyzing
Dataflow in Code with Large Language Models. Advances in Neural Information Processing Systems 37 (2025), 131545–
131574.
[75] Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, and Xiangyu Zhang. 2024. Sanitizing Large Language Models
in Bug Detection with Data-Flow. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser
Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA,
3790–3805. https://doi.org/10.18653/v1/2024.findings-emnlp.217
[76] Haijun Wang, Xiaofei Xie, Shang-Wei Lin, Yun Lin, Yuekang Li, Shengchao Qin, Yang Liu, and Ting Liu. 2019. Locating
vulnerabilities in binaries via memory layout recovering. In Proceedings of the 2019 27th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19).
[77] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li,
Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe
Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open
Platform for AI Software Developers as Generalist Agents. arXiv:2407.16741 [cs.SE] https://arxiv.org/abs/2407.16741
[78] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing
systems 35 (2022), 24824–24837.
[79] Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copiloting the Copilots: Fusing Large Language Models
with Completion Engines for Automated Program Repair. In Proceedings of the 31st ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for
Computing Machinery.
[80] Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian.
2024. Enchanting Program Specification Synthesis by Large Language Models Using Static Analysis and Program
Verification. In Computer Aided Verification. Springer Nature Switzerland.
[81] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large
Pre-trained Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE ’23).
[82] Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42
each using ChatGPT. CoRR (2023).
[83] Jiahong Xiang, Xiaoyang Xu, Fanchu Kong, Mingyuan Wu, Haotian Zhang, and Yuqun Zhang. 2024. How Far Can We
Go with Practical Function-Level Program Repair? arXiv preprint arXiv:2404.12833 (2024).
[84] Yichen Xie and Alex Aiken. 2005. Context-and path-sensitive memory leak detection. In Proceedings of the 10th
European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations
of software engineering. 115–125.
[85] Qi Xin, Haojun Wu, Jinran Tang, Xinyu Liu, Steven P. Reiss, and Jifeng Xuan. 2024. Detecting, Creating, Repairing,
and Understanding Indivisible Multi-Hunk Bugs. Proc. ACM Softw. Eng. FSE (2024).
[86] Wen Xu, Juanru Li, Junliang Shu, Wenbo Yang, Tianyi Xie, Yuanyuan Zhang, and Dawu Gu. 2015. From collision
to exploitation: Unleashing use-after-free vulnerabilities in linux kernel. In Proceedings of the 22nd ACM SIGSAC
Conference on Computer and Communications Security. 414–425.
[87] Xuezheng Xu, Yulei Sui, Hua Yan, and Jingling Xue. 2019. VFix: Value-Flow-Guided Precise Program Repair for
Null Pointer Dereferences. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 512–523.
https://doi.org/10.1109/ICSE.2019.00063
[88] Hua Yan, Yulei Sui, Shiping Chen, and Jingling Xue. 2018. Spatio-temporal context reduction: a pointer-analysis-based
static approach for detecting use-after-free vulnerabilities. In Proceedings of the 40th International Conference on
Software Engineering (ICSE ’18). ACM.
[89] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press.
2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. In The Thirty-eighth Annual
Conference on Neural Information Processing Systems. https://arxiv.org/abs/2405.15793
[90] He Ye, Matias Martinez, Xiapu Luo, Tao Zhang, and Martin Monperrus. 2022. SelfAPR: Self-supervised Program Repair
with Test Execution Diagnostics. In 37th IEEE/ACM International Conference on Automated Software Engineering (ASE
’22). ACM.
[91] He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural Program Repair with Execution-based Backpropagation.
In 44th IEEE/ACM 44th International Conference on Software Engineering (ICSE ’22). ACM.
[92] Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated
Agent Systems for Real-World Repo-level Coding Challenges. arXiv:2401.07339 [cs.SE] https://arxiv.org/abs/2401.07339
, Vol. 1, No. 1, Article . Publication date: June 2025.

Xiao Cheng
[93] Quanjun Zhang, Chunrong Fang, Tongke Zhang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023. Gamma: Revisiting
Template-Based Automated Program Repair Via Mask Prediction. In 38th IEEE/ACM International Conference on
Automated Software Engineering (ASE ’23). IEEE.
[94] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program
Improvement. CoRR (2024).
[95] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided
edit decoder for neural program repair. In 29th ACM Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering (ESEC/FSE ’21). ACM.
[96] Qihao Zhu, Zeyu Sun, Wenjie Zhang, Yingfei Xiong, and Lu Zhang. 2023. Tare: Type-Aware Neural Program Repair. In
45th IEEE/ACM International Conference on Software Engineering (ICSE ’23). IEEE.
, Vol. 1, No. 1, Article . Publication date: June 2025.
