# Reviewer QA1g

* Summary: The paper provide a neural network based approach to learn the solution to the Fokker plank equation. It provides error analysis and experimental results on relatively low dimensional data.
* Main Strengths: The paper presents a new loss function to train a neural network to learn the solution of a fokker-plank equation as well as a regularization term controlling the error.
  
* Main Weekness: The main idea is straightforward to some extent, however provided a good analysis of the problem. I also think more evaluations on higher dimensional data is beneficial, for example the Fokker-planck equation appears in score-based diffusion models as the pdf of the denoising process (as we have the starting pdf for the backward denoising process) and this method can be benchmarked to see if we are learning the true pdfs modeled by the diffusion model.

* Detailed Comments:
  * Can the authors benchmark this method on diffusion models and see what is the final time likelihood recovered by solving the Fokker-plank vs ODE based approachs?

  * Do the authors think that the proof techniques used here for Fokker-planck to develop the error bounds can be used to generalize these results to general PDEs or a class of PDEs (parabolic PDEs)


# Author's Response to Reviewer QA1g

We sincerely appreciate the reviewer's careful reading of our manuscript and insightful suggestions.

% ## Summary
% LL: The reviewer does mention error bounds, so I would not be correcting his summary because he has not emphasized enough the error bounds. I would rather focus on the two questions he make. In particular, I would rephrase the answer to Comment 1, as right now it seems that we do not do diffusion models because we are not expert on them
% We appreciate the reviewer’s effort to summarize our paper. However, we respectfully note that the summary does not fully capture the central focus of our contribution. 
% Our core objective is to derive rigorous approximation error bounds for a learned solution to the Fokker–Planck equation using neural networks. Specifically, we prove the theoretical feasibility of our error-bounding framework and demonstrate its practicality through numerical experiments.
% We believe these theoretical and experimental results represent the key advances of our work, distinguishing it from other approaches that simply propose neural network-based PDE solvers without offering quantified error guarantees.

### Diffusion models Benchmark
> Can the authors benchmark this method on diffusion models and see what is the final time likelihood recovered by solving the Fokker-Planck vs. ODE-based approaches?

### Response:
We appreciate the reviewer’s suggestion to benchmark our method on diffusion models. Currently, our work centers on the theoretical derivation of error bounds for PINN solutions to Fokker–Planck equations (along with potential extension to linear PDEs). Diffusion models lie outside our domain of expertise. Nevertheless, we will study these models and attempt relevant experiments to include in this paper as well as future investigations. Thank you for this constructive suggestion.

% [ML: Say diff. models are outside our domain of expertise. We will study it and attempt to run experiments with it, however, ....]

### Error bounds generalization to other PDEs
> Do the authors think that the proof techniques used here for the Fokker-Planck equation to develop the error bounds can be used to generalize these results to general PDEs or a class of PDEs (parabolic PDEs)?

### Response:
Our proof framework extends to linear PDEs under standard initial and boundary conditions, as discussed in Remark 3 and Appendices A.9, B.6, and C.8. For example, we have included a small-scale experiment on the heat equation (a parabolic PDE) to illustrate this point. Extending our method to nonlinear PDEs is ongoing work, and we believe it is a promising direction.

---

# Reviewer nA4b
* Summary: The authors propose the use of physics-informed neural networks to approximate the solution PDF of stochastic differential equations. Bounds on the worst case approximation error are provided through a series of error functions learnt using additional physics-informed neural networks. These bounds are simplified in terms of two networks, and a bound requiring a single network is also presented. The error bounds and performance of the proposed method is shown on solvable, non-linear, and high-dimensional SDEs.
* Main weakness: 
  * The primary tuning parameters in the method appear to be the weights introduced in Eq. (6). The authors provide the values used in the experiments, but it is unclear to me how these should be tuned for general problems.
  * The presentation of Table 1 could be clearly. The time in seconds for training $\hat p$ and $\hat e_1$ are rounded to the nearest 10 for the non-linear and high-dimensional problems, suggesting this may not have carefully been monitored. Further, it is unclear if the units for the Monte Carlo method are presented as samples or computation time in seconds. This should be clarified, and if it is the former, time taken should also be reported.
* Detailed comments:
  * in Figure 3a is the Gaussian mixture a standard approach to this problem? If so a reference should be provided, otherwise it should be justified.
  * Figure 21 is missing labels.


# Author Response to Reviewer nA4b

We sincerely thank Reviewer nA4b for taking the time to read our manuscript and for providing such thoughtful comments. 

### Tuning the weights in Eq. 6
> The primary tuning parameters in the method appear to be the weights introduced in Eq. (6). The authors provide the values used in the experiments, but it is unclear how these should be tuned for general problems.

### Response:
Thank you for your comment. Weight selection in PINN training remains an open question with no universal rule. Numerous strategies, including automated tuning, exist in the literature.

In our work, our primary focus is on the theoretical derivation of error bounds. Thus, we begin by assuming a sufficiently well-chosen weight so that the training loss of  PINN converges to a small value. If the weight is not tuned appropriately for the PINN $\hat{p}$, a large training loss leads to a poor approximation of $p$, ultimately resulting in a larger error bound—regardless of how well the weights for training $\hat{e}_1$ are chosen.

It is critical to point out that our error-bound derivation does **not** require the weights to be optimally tuned. In all our experiments, we used a simple, fixed weighting scheme.  The results show that this scheme can construct the proposed error bound, provided the training loss is sufficiently small. 

In the final version of the manuscript, we will emphasize our weight selection so that researchers familiar with PINN training can build upon our simple approach. It would also be valuable to investigate how the error bound behaves under more advanced weight-tuning strategies for the same problem.


### Clarity in Table 1
> The presentation of Table 1 could be clearer. Time in seconds for training $\hat{p}$ and $\hat{e}_1$ are rounded to the nearest 10 for the non-linear and high-dimensional problems, suggesting this may not have been carefully monitored. Further, it is unclear if the units for the Monte Carlo method are presented as samples or computation time in seconds. If it is the former, time taken should also be reported.

### Response:
Thank you for these observations. We acknowledge that rounding training times to the nearest 10 seconds may appear coarse. More detailed timing data is recorded in our supplementary code and will be explicitly available in the final version.

Regarding the Monte Carlo (M.C.) column:
- The reported values denote total simulation time (in seconds), also rounded to the nearest 10. Although this is mentioned in the Table 1 caption, we will clarify it further in both the caption and the main text.
- Information on sample size and integration time step for each experiment, currently in Appendix B, will be highlighted more clearly in the final version.

We hope these clarifications improve transparency regarding computational costs and timing comparisons.

### Gaussian Mixture Method in Figure 3a
> In Figure 3a, is the Gaussian mixture a standard approach to this problem? If so, a reference should be provided; otherwise, it should be justified.

### Response:
Thank you for your comment. We used a Gaussian mixture model (GMM) in Figure 3a as a complementary comparison since GMMs are a standard tool for uncertainty propagation in nonlinear dynamics (e.g., Terejanu et al., 2008 and Vittaldev et al. 2016). Although these methods do not provide continuous-time error bounds, they serve as a classical alternative to our PINN-based approach. In the final version of the paper, we will add further references and clarify this choice in the Related Work section.

---


# Reviewer D41i
* Main strengths:
  * I have given the paper a 2 for the originality/novelty score, although this is because it is not stated in the paper whether the recursive error bound construction is novel or not. If it is novel I would increase to a 3, and if it is not I will leave the score where it is
  
* Main weakness: One main weakness is that it is not fully explored why the authors wish to bound the exact form of the error that they do as opposed to, say, the total error bound. Nor is it explored as to why exactly their form of error provides them with a much tighter bound than the existing literature on total error bounds. It would be nice to include a couple of paragraphs explaining why the particular error they study is important, and why it yields such nice bounds.

* Detailed comments
  * Page 3: "during training spatial-temporal data points ... are sampled" What distribution are they sampled from? And does this have an effect on e.g. the training? The loss functions are expectations with respect to the empirical measures defined by the sampled points, so it seems like the distribution these points are sampled from is important.
  

# Author Response to Reviewer D41i

We would like to thank Reviewer D41i for thoroughly reading our manuscript and providing constructive feedback. 

### Novelty of recursive error bound construction
> I have given the paper a 2 for the originality/novelty score, although this is because it is not stated in the paper whether the recursive error bound construction is novel or not. If it is novel I would increase to a 3, and if it is not I will leave the score where it is.

### Response:
Thank you for your comment. Our recursive error bound construction is novel. Rather than directly analyzing the residual of the learned solution (see Related Work), our method recursively constructs error functions, showing that only two additional networks are sufficient to achieve arbitrarily tight bounds under certain conditions. We will emphasize this novelty in the revised manuscript.

### Motivation of bounding worst-case error
> *One main weakness is that it is not fully explored why the authors wish to bound the exact form of the error that they do as opposed to, say, the total error bound. Nor is it explored as to why exactly their form of error provides them with a much tighter bound than the existing literature on total error bounds. It would be nice to include a couple of paragraphs explaining why the particular error they study is important, and why it yields such nice bounds.*

### Response:
Our method focuses on worst-case error bounds because it captures the maximum approximation error at specific times and regions—details that total error bounds overlook (See Related Work). For systems described by stochastic differential equations (equivalent to Fokker-Planck equations), this approach allows us to non-trivially upper or lower bound event probabilities (e.g., the chance of entering an unsafe or target region). In applications like autonomous driving, precise error bounds are essential for setting reliable safety margins, such as estimating the probability of pedestrian crossings. Similarly, in systems subject to chance constraints, knowing the worst-case error improves robustness. We will expand on these motivations in the revised manuscript.

### Sampling Distribution
> *"During training, spatial-temporal data points … are sampled" What distribution are they sampled from? And does this have an effect on e.g. the training? The loss functions are expectations with respect to the empirical measures defined by the sampled points, so it seems like the distribution these points are sampled from is important.*

### Response:
Thank you for highlighting the importance of the sampling strategy. In our experiments, we initially use uniform and normally distributed sampling over the space-time domain—a common approach in PINNs research. We then apply an adaptive sampling procedure (see Lu et al. 2021) to add collocation points where the residual is largest (details in Appendix C). We agree that the sampling distribution affects training efficiency and convergence. In the revised manuscript, we will clearly state our use of uniform, normal, and adaptive sampling, and include a brief discussion on advanced sampling techniques with appropriate references.

We hope these clarifications offer a deeper understanding of our sampling methodology and its potential extensions. Please note that our primary focus is on the theoretical derivation of error bounds and their validation through numerical experiments.

---


# Reviewer DjEj
* Summary: This paper proposes a recursive approach to handle the error in PINN. It develops a theoretical framework to construct tight error bounds using PINNs. It cound induce tighter error bound and better experimental results.
* Main strength: The method is new and try to tackle the accuracy problem in PINN, **which may attract many audience in this field.**
* Q2-4 Reproducibility: 2: Fair: key resources (e.g. proofs, code, data) are unavailable but key details
* Main weakness: The motivation from theoretical view is not clearly stated (see below), which makes me hard to justify the value of the proposed method.
* Detailed comments:
  * Could you explain the relation between the Fokker-Planck equation and the error analysis? Could the proposed method be applied to the orginal PDE instead of the FP-equation? For example, using a PINN method to learn the residuals of the orginal PDE.
  * Is it possible that smaller approximation error lead to worse generalization or over-fitting? How about the generalization error of the proposed method.
  * I am not very clear about definition 1, that is, if $e_{i-1}$ and $\hat{e}_{i-1}$ are both large, but $e_i$ is near zero, how to improve the accuracy throught the algorithm? 
  * The motivation on the tigher bound is not mathematical rigorous stated. Do you mean that from a mean square error to $L_\infty$ error? 


# Author Response to Reviewer DjEj

We sincerely appreciate the time you took to read our manuscript and provide insightful comments. 


### Reproducibility Assessment
> The method is new and try to tackle the accuracy problem in PINN, **which may attract many audience in this field.**
> key resources (e.g. proofs, code, data) are unavailable but key details (e.g. proof sketches, experimental setup) are sufficiently well-described for an expert to confidently reproduce the main results.

### Response:
Thank you for your summary and for noting the potential impact of our recursive error handling approach for PINNs. We want to clarify that complete proofs, code, and data are available in the appendices and supplementary material, which we uploaded with the paper and believe are available to the reviewer. In the final version, we will highlight these resources even more to help experts replicate our results confidently.

### Motivation of the proposed error bound method
> *“The motivation from a theoretical view is not clearly stated, which makes me hard to justify the value of the proposed method.”*

### Response:
Our method focuses on worst-case error bounds because it captures the maximum approximation error at specific times and regions—details that total error bounds overlook (See Related Work). For systems described by stochastic differential equations (equivalent to Fokker-Planck equations), this approach allows us to non-trivially upper bound event probabilities (e.g., the chance of entering an unsafe or target region). In applications like autonomous driving, precise error bounds are essential for setting reliable safety margins, such as estimating the probability of pedestrian crossings. Similarly, in systems subject to chance constraints, knowing the worst-case error improves robustness. We will expand on these motivations in the final version.

### Relation Between Fokker–Planck and the Original PDE
> *“Could you explain the relation between the Fokker–Planck equation and the error analysis? Could the proposed method be applied to the original PDE instead of the FP-equation?”*

### Response:
The “original PDE” in our work **is** the Fokker–Planck equation itself, which governs the evolution of the PDF for a stochastic differential equation (SDE). Our error analysis is therefore directly tied to this PDE. More broadly, as noted in Remark 3 and Appendices A.9, B.6, and C.8 of the paper, our proof framework also applies to other linear PDEs under standard initial/boundary conditions. To illustrate this generalization, the paper includes a small-scale experiment on the heat equation (a parabolic PDE). Extending our approach to nonlinear PDEs is ongoing work and remains a promising direction.

### Possible Overfitting and Generalization Error
> *“Is it possible that smaller approximation error leads to worse generalization or over-fitting? How about the generalization error of the proposed method?”*

### Response:
Our approach bounds the worst-case error over a chosen subset $X' \times T$ of the domain. If generalization is defined on $X' \times T$, then our bound guarantees that the error remains within that limit. However, if the learned solution and its error bound are applied outside the training domain, no such guarantee holds. 

Overfitting can arise if the training samples (Eq. 7a–b) do not sufficiently cover the domain. To address this, we rely on uniform and normally distributed sampling over the space–time domain—common practices in PINNs—and further employ an adaptive sampling method (Lu et al., 2021) that adds collocation points where residuals are largest (see Appendix C). 

In the final version, we will clarify our use of uniform, normal, and adaptive sampling. We will also reference relevant advanced training strategies that address overfitting in PINNs (e.g., Basir 2023).

### Clarification of Definition 1
> *“If $e_{i-1}$ and $\hat{e}_{i-1}$ are both large, but $e_i$ is near zero, how does the recursive algorithm improve accuracy?”*

### Response:
Our recursive error function approach is designed to rigorously bound the worst-case error between the learned PDF solution $\hat{p} := \hat{e}_0$ and the true solution $p := e_0$ (see Definition 1), rather than to directly improve approximation accuracy. 
When both $e_{i-1}$ and $\hat{e}_{i-1}$ are large but $e_i$ is near zero, it indicates that $\hat{e}_{i-1}$ is accurately capturing the true error $e_{i-1}$. Although training the next error function $\hat{e}_i$ can be numerically challenging when $e_i$ is very small, this does not undermine our theoretical framework for constructing the error bound of $p-\hat{p}$.  In fact, it further highlights our contribution in developing a method that constructs error bounds using a finite number of approximate error functions. We hope this clarifies your concern.

### Tighter Bound vs. $L_{\infty}$ Error
> *“The motivation on the tighter bound is not mathematically rigorous. Do you mean going from mean square error to $L_{\infty}$ error?”*

### Response:
Thank you for your question. Our worst-case error, as defined in Eq. (5), is the maximum deviation of $\hat{p}(x,t)$ from the true solution $p(x,t)$ over the spatial domain $X'$ at each time $t$, i.e.,
\[
\sup_{x \in X'} \bigl|p(x,t) - \hat{p}(x,t)\bigr|.
\]
This is effectively an $L_{\infty}$ error.

When comparing two valid error bounds, $B_1(t)$ and $B_2(t)$, such that
\[
\sup_{x \in X'} \bigl|p(x,t) - \hat{p}(x,t)\bigr| \leq B_1(t)
\quad \text{and}\quad
\sup_{x \in X'} \bigl|p(x,t) - \hat{p}(x,t)\bigr| \leq B_2(t),
\]
we say $B_2$ is tighter than $B_1$ if $B_2(t) < B_1(t)$ for each $t$.

\textbf{Motivation for a Tighter Error Bound.} In practice, $B_1$ is easier to construct, as it involves training only one error function $\hat{e}_1$. However, $B_1$ may be overly conservative in certain applications. Therefore, we investigate the “best” possible bound that can be constructed with finite neural networks. Our main result shows that by training two error functions, we obtain a second-order bound, $B_2$, that can become arbitrarily tight under suitable conditions.

We hope these revisions help clarify the theoretical foundations and practical significance. We would be happy to further elaborate on any details if the reviewer feels it would be helpful.

---


# B89t
* Main weakness: Corollary 1's proof is based on seemingly unreasonable assumption. See comment #6. Corollary 1's proof is based on the assumption that there exists a "virtual" $\hat{e}_2 = e_2$. But the practical challenge argument in Sect. 4 is that it is extremely difficult to train $\hat{e}_2$. How does this make sense?
* Detailed Comments:
  * pg. 1, "introduce a framework for tightly bounding the worst-case approximation error as a function of time" => What about the subsets of space mentioned in the previous paragraph? I suggest that authors remove the subsets of space in the previous paragraph.
  * Problem 1, I wonder why X' was used to give the notion of bounded subset. On the contrary, T was straightly used with 'bounded' modifier.
  * Assumption 1, I wonder why non-negative real number condition for PDF R>=0 is not obeyed or discussed here.
  * pg. 4, "Although the weight association may affect convergence rate, Shin et al., Mishra and Molinaro justify the theoretical convergence of PINNs output to the true solution as the loss is minimized." => It's unclear what authors argue here. They justified the theoretical convergence REGARDLESS of chosen weights?
  * A.3, $\hat{e}^*_{i-1} = e^*_{n-2} > 0$ => There's no ground for $\hat{e}^*_{i-1} = e^*_{n-2}$. (short response: typo, it is just $\hat{e}^*_{i-1} > 0$ by Assumption 2.)
  * Eqs. 23b and 23c, It is misleading that these equations are based on Definition 1. In fact, these equations are based on the fact that $\hat{e}_1$ approximates $e_1$, along with Eqs. 8 and 9.
  * pg. 6, regularization loss to Eq. (6): => Isn't it supposed to be added to Eq. 23a? Or both?
  * pg. 6, Similarly, In Assumption 1 and its discussion, authors argue that $\hat{p}$ can be smooth by a fully connected NN with smooth activations. If this is the case, why do you even need to introduce gradient regularization?
  * Tbl. 1, I wonder why times are not reported for high-dim N.I. cases?


# Author Response to Reviewer B89t

Thank you for your positive overall evaluation of our work and for the detailed comments that will help us strengthen the manuscript.

---

### Corollary 1's Proof
> **“Corollary 1’s proof is based on a seemingly unreasonable assumption. See comment #6. Corollary 1's proof is based on the assumption that there exists a ‘virtual’ $\hat{e}_2 = e_2$. But the practical challenge argument in Sect. 4 is that it is extremely difficult to train $\hat{e}_2$. How does this make sense?”**

### Response:
Thank you for your careful review. To clarify, there are two points to note:

1. The second-order bound (Section 4) does require training $\hat{e}_2$, and we acknowledge that achieving the necessary accuracy for $\hat{e}_2$ is challenging in practice.

2. In contrast, our first-order bound (Section 5) assumes a “virtual” $\hat{e}_2 = e_2$ only for theoretical reasoning. In practice, the first-order bound does not require training $\hat{e}_2$, thus avoiding those challenges.

We will make this distinction clearer in the final version.

---

## Detailed Comments

1. **Page 1, “function of time” vs. “subsets of space.”**  
    Thank you for the suggestion. We will revise the text to:  
“Introduce a framework for tightly bounding the worst-case approximation error as a function of time over the subset of interest…”  
This clearly indicates that our analysis addresses both the time dimension and specific spatial subsets.

2. **Problem 1, the use of $\mathbf{X'}$.**  
   Thank you for your comment. We introduced $X'$ to emphasize that we focus on bounded spatial subsets. Meanwhile, $T$ is defined as a bounded time interval to avoid potential ill-posedness of the Fokker–Planck equation over an infinite horizon. We will add a clarifying sentence to ensure this distinction is clear to the readers.

3. **Assumption 1, non-negative real values for PDFs.**  
   Thank you for your comment. In our setup, $\hat{p}$ approximates the true PDF $p$, and non-negative activations (e.g., exponent, softplus, or squared output) can be used to ensure $\hat{p} \geq 0$. We will add a note to clarify this in the final version.

4. **Page 4, weighting remark.**  
   Thank you for your comment. Our intention was to note that the theoretical convergence of PINNs—established by works such as (Shin et al., 2017) and (Mishra and Molinaro, 2923)—does not revolve around optimal weight selection per se. While weight tuning can affect the convergence speed, the fundamental guarantee of convergence remains intact. We will clarify this point in the final version.

5. **A.3, $\hat{e}^*_{i-1} = e^*_{n-2}$ > 0.**  
   Thank you for catching that typo. We meant to state that $\hat{e}^*_{i-1} > 0$ by Assumption 2. We will correct this in the revised manuscript.

6. **Equations 23b and 23c.**  
   Thank you for pointing this out. Indeed, Definition 1 summarizes the general recursive approach, but Eqs. 23b and 23c are more specifically based on how $\hat{e}_1$ approximates $e_1$ (Eqs. 8 and 9). We will revise the text to ensure the references more accurately track back to Eqs. 8 and 9, reducing any confusion.

7. **Page 6, adding regularization & “If the NN is smooth, why do we need gradient regularization?”**  
   - **Location of the regularization:** For constructing the **first-order** error bound, we only add gradient regularization to the training of $\hat{p}$ (Eq. 6), not to Eq. 23a for the training of $\hat{e}_1$. This is primarily because $\hat{e}_1$ does not need to feed into another error network (like $\hat{e}_2$).
   - **Why we still need regularization when $\hat{p}$ is smooth by construction:** Even though $\hat{p}$ is smooth due to its activation functions, the differential operator $\mathcal{D}[\hat{p}]$ can still oscillate significantly, making it hard to train subsequent networks (e.g., $\hat{e}_1$). Adding gradient regularization helps to regulate large variations in $\mathcal{D}[\hat{p}]$, thus making the physics-informed learning of $\hat{e}_1$ simpler. We will expand this discussion in the main text below Eq. 23c and note that additional techniques exist in the literature for handling oscillatory PDE residuals in PINNs.

8. **Table 1, timing for high-dimensional N.I.**  
   Thank you for your comment. In our study of high-dimensional time-varying Ornstein-Uhlenbeck (OU) processes, we rely on a semi-analytical integration method to calculate the “true” probability density function (PDF). This method directly integrates the Gaussian distributions by taking advantage of the time-varying linear dynamics. Unlike our PINNs approach or the standard Monte Carlo simulation—which both handle general nonlinear dynamics—the integration method we use does not involve simulating multiple trajectories. As a result, the timing for “N.I.” are not directly comparable to those for PINNs or Monte Carlo. We will update the manuscript with clearer explanations and include a footnote with the exact timings to prevent any further confusion.

---

Once again, thank you for your thoughtful review. We will incorporate these clarifications and corrections into the revised manuscript to ensure it is both more precise and more transparent. We appreciate your helping us refine our work, and we look forward to finalizing these improvements.

---

# Reviewer B89t Seccond Comment

1. After reading the response, I still have doubt on the strong assumption for Corollary 1's proof. The authors' argument on practicality of first-order bound (not needing to worry about 
) is ONLY guaranteed after Corollary 1. In other words, the argument doesn't make sense when Corollary 1's soundness is still questioned. 
I think a better way to resolve this situation is authors admit in the text that authors opt for practical first-order bound with this strong assumption. In other words, authors should acknowledge that this is some approximation, rather than arguing the first-order version is equivalent to the second-order one.

2. I still wonder what bounded subset of space authors are talking about. Authors used 'bounded' for time interval in more casual manner. But is there any definition for why considering X' instead of X is more relevant?


# Second Response to Reviewer B89t
We sincerely thank the reviewer for their time, prompt response, and willingness to engage in an active discussion.

### Response to Corollary 1's Proof
* We agree with the reviewer that the strong assumption, i.e., exists a virtual function $\hat{e}_2(x,t) = e_2(x,t)$, should be clarified further. Indeed, this strong assumption is guaranteed to hold **approximately** (e.g., if the virtual function $\hat{e}_2$ is a **finite neural network**). Recognizing this suggested limitation, we would like to correct the proof of Corollary 1, following a similar path, but a different perspective. 
    1. $e_1 = p - \hat{p} = \hat{e}_1 + e_2$, which implies
    2. $|e_1(x,t)| \leq \max_x|\hat{e}_1(x,t)| + \max_x|e_2(x,t)|$.
    3. Instead of viewing $e_2(x,t)$ as a 'virtual and perfect' approximation via $\hat{e}_2(x,t)$ such that $\hat{e}_2(x,t) = e_2(x,t)$ over $x \in X^{'}$ at time $t$, we **keep $e_2(x,t)$ as exact**.
    4. By $0<\alpha_1(t) < 1$ , $\alpha_1$ definition, and recursive error definition (Def. 1), we have $\alpha_1 \max_x|\hat{e}_1(x,t)| := \max_x|e_1(x,t)-\hat{e}_1(x,t)| = \max_x|e_2(x,t)|$.
    5. Hence, step 2 becomes $|e_1(x,t)| \leq \max_x|\hat{e}_1(x,t)| + \alpha_1 (t)\max_x|\hat{e}_1(x,t)| = \max_x|\hat{e}_1(x,t)|(1+\alpha_1(t)) < 2 \hat{e}_1^*(t)$, which is essentially Eq. 41 of the original proof.
Hence, this revised version **does not require** the strong existence assumption of $\hat{e}_2(x,t) = e_2(x,t)$ anymore. Once again, we thank the reviewer for the careful review to help us correct our errors and improve the manuscript.

### Response to why bounded subset $X^{'}$ ?
* $X^{'}$ can be any bounded subset of X. Intuitively, $X^{'}$ represents the set where we are interested in computing the approximation error.  In practice, we do not take $X^{'}=X$ because $X$ could be unbounded, and the boundness of $X^{'}$ is required to guarantee that $\hat{p}$ remains bounded. So, if one would have the additional assumption that that $\hat{p}$ remains bounded in $X$, then we could indeed simply assume $X^{'}=X$.



