\documentclass{article}

\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{xcolor}
\usepackage{todonotes}

\newcommand{\lc}[1]{{\color{blue}LC : #1}}
\newcommand{\gd}[1]{{\color{red}GD : #1}}
\newcommand{\fk}[1]{{\color{green}FK : #1}}
\newcommand{\lz}[1]{{\color{orange}LZ : #1}}
\newcommand{\bl}[1]{{\color{violet}BL : #1}}
\newcommand{\av}[1]{{\color{teal}AV : #1}}

\begin{document}

\lc{\textbf{Answer for all the reviewers :} We want to thank all the reviewers for their work evaluating our manuscript. Their comments and questions are extremely useful to improve the presentation of our work. We will make sure to integrate all the remarks to fix the remaining typos, inconcistencies in the notations and unclear sentences in the revised version.

* Reviewers pointed out that appendices A and B requires prior knowledge on the GAMP litterature and are not self-contained. Following their remarks, we will make the appendices self-contained by providing extensive details on the different steps of the computations : in particular, we will first add the derivations of the state evolution equations (36) of GAMP; the computations are fairly standard and similar to e.g [1], [2]. We will also show in detail how equations (36) simplify to equations (46) in the ridge case and detail the derivations of equations (49 - 52) in Appendix B. We will make sure to provide intuition on the core concepts of GAMP, e.g. the "channel" and "denoising" functions and how they are linked to the prior and likelihood.

* We understand that reading self-consistent equations (17, 18) and (25, 26) can be tedious. Concerning Theorem 4.1, while the equations cannot be simplified for general resampling, in the case of subsampling with ratio $r$, the equations can be written in a much simpler form, that we write below (omitting the $ss$ superscript on the overlaps) : 
$$ v = \frac{1 - \lambda - \alpha r + \sqrt{(\alpha r + \lambda - 1)^2 + 4 \lambda}}{2 \lambda} $$
$$ m = \rho ( 1 - \lambda v)$$
$$ Q_{11} = m^2 \times \frac{\alpha \rho r + \rho + \Delta - 2 m}{\alpha \rho^2 r - m^2} $$
$$ Q_{12} = m^2 \times \frac{\alpha \rho + \rho + \Delta - 2m}{\alpha \rho^2 - m^2} $$

We note that in this form, the quantities $\hat{m}, \hat{Q}, \hat{V}$ do not appear. We will add a paragraph in section 4.1 dedicated to subsampling and containing these closed-form equations. Similarly, in Theorem 4.2 we will replace equations (25, 26) by the following equations, which are both equivalent and simpler :

$$    v_1 = v_2 = \frac{1 - \lambda - \alpha + \sqrt{(\alpha + \lambda - 1)^2 + 4 \lambda}}{2\lambda}  $$
$$    m_1 = m_2 = \rho(1 - \lambda v) $$
$$    Q_{11} = m^2 \times \frac{\alpha \rho + \rho + \Delta - 2 m}{\alpha \rho^2 - m^2} $$
$$    Q_{12} = m^2 \times \frac{\alpha \rho + \rho - 2m}{\alpha \rho^2 - m^2} $$

We stated these equations in Appendix C.1.2 (equation 79) but agree that putting them in the main will make the theorem more concrete.

[1] [2202.03295, Appendix A]

[2] [1806.05451, Appendix F]


}


\section{Reviewer Vccu, Score 7 confidence 4}


\lc{ \textbf{Answer : } We thank the reviewer for their comments, and their appreciation of our work. 

> It would have been nice to see some empirical evidence

 We agree that it would be interesting to see to what extent our results hold on real datasets. However, the goal of our work is to compare the performance of resampling methods in a model where the data-generating process is known. This allows us e.g. to compare with the performance of the bayes optimal estimator (which is not possible with real data).

> One question I had was whether similar results hold in the case of the multiplicative bootstrap

If the reviewers refer to the paper arxiv:2309.03354, then our theories can be used to study multiplied bootstrap. Indeed, it consists of minimizing 
$
L^k = \sum_{i = 1}^N w_{k, i} (y_i - x_i^Tb)^2
$
with a multiplier matrix $w_{k ,n}$. Our model can accomodate any choice of distribution of $w_{k, n}$ under the conditions that 1) the $w_{k, n}$ are identically distributed 2) for a fixed $k$, $w_{k, n}$ and $w_{k, n'}$ are independent for $n \neq n'$ and 3) for fixed $n$, all the pairs $(w_{k, n}, w_{k', n})$ for ${k \neq k'}$ follow the same distribution. Of course, the performance of bootstrap then depends on the choice of $w_{k, n}$
}

\newpage
\section{Reviewer zPQL, Score 7 confidence 3}

\lc{\textbf{Answer : }
> The appendices might be too extensive

Unfortunately, the length of the appendices is partially due to the nature of the mathematical tools. It is not possible not shorten the appendices, but we refer the reviewer to the common answer where we exlain how we will make them clearer in the revised version.

> Perhaps consider changing the title and the second paragraph in the second column on the first page to avoid readers thinking you stick to $y \in R$

We agree that the use of "regression" can sometimes be confusing. As suggested, we will make it clear in the introductio that we also study classification tasks.
}

\newpage
\section{Reviewer ELCh, Score 6 confidence 3}

\lc{\textbf{Answer : } We thank the reviewer for their constructive review and exhaustive list of comments. We will make sure to integrate them in the camera-ready version of the paper, in particular those concerning inconcistencies in the notation. To address the main weaknesses : 

> they only consider a fixed ratio of dimension and sample size

This assumption is made for convenience in our computations, but all of our results also hold if we assume a limit $n/d \rightarrow \alpha$ instead. 

> The distribution of their samples looks quite restricted

In our model, the teacher weights $\theta_{\star} \sim N(0, I_d)$ are sampled from the Gaussian distribution with unit covariance. So the purpose of having the input $ {x}$ scale as $1/\sqrt{d}$ is that the local field $\theta_{\star}^T  {x}$ is of order one.
This assumption in our model is not restrictive, and alternatively we could define our model (1) as 
    $$
    y_i \sim p( \cdot | \frac{ \theta_{\star}^T  {x}_i }{\sqrt{d}} ),  {x}_i \sim N(0, I_d)
    $$
where the scaling by $\sqrt{d}$ is done in the likelihood function. Moreover, we would like to point out that this is a standard assumption in the in the high-dimensional statistics litterature, see for example [1] or [2].

Finally, note that our tools allow to extend our study to more complex and realistic data structures such as random features, as done in [3] or [4]. However, the paper is already dense enough, we thus leave the study of more complex data distributions to future work.

To answer additional comments : 

> What do you mean by absolute performance? 

We refer to the generalization error of the estimator.

> What is "rr" standing for? I think you have not introduced that

rr refers to residual resampling (defined in section 4), we will introduce the acronym in the revised version

> You do not give much intuition behind the terms that appear in equation (16)

Intuitively, $m$ is the correlation between the teacher and the resample average, while the matrix $Q$ is essentially a Gram matrix between two independent resamplings. We will add this description in the camera-ready. 

> I am wondering whether it is possible to just define $Q^t_{11}$ and additionally allow $t = fr$ next to $t = \{ pb, ss, jk \}$

The full resampling $fr$ corresponds to a resampling of the whole dataset $D$ and not to a reweighting of the $ {x}_i$, contrary to $pb, ss, jk$, which is why it is not treated the same way. It would be possible to unify $fr$ and the other resampling methods but would unnecessarily complexify the notation.

> I find the logical flow of Theorem 4 not optimal [ ... ]

We will add the superscript $t$ in equations (17, 18) and explain how the matrix $ {Q}$ is defined from "stacking" the overlaps defined in equation (16) : 
$$
 {Q}^t = \begin{pmatrix} Q_{11}^t & Q_{12}^t \\ Q_{12}^t & Q_{11}^t \end{pmatrix}
$$
whereas 
$$
 {Q}^{t, fr} = \begin{pmatrix} Q_{11}^t & Q_{12}^{t,fr} \\ Q_{12}^{t, fr} & Q_{11}^{fr} \end{pmatrix}
$$
Likewise, the vector $ {m}^t = (m_1^t, m_1^t)$ and $ {m}^{fr} = (m_1^{fr}, m_1^{fr})$.To give an intuition of the results, we will explain that the matrix $ {Q}^t$ represents the Gram matrix of two estimators trained on the same dataset with two independent resample, while $Q^{t, fr}$ is the Gram matrix of one resampled estimator and the ERM estimator trained on the full dataset, while the vector $ {m}$ contains the correlation between the estimator and the teacher $\theta_{\star}$.

> The proofs, especially in Sections A and B require substantial background knowledge, I am not familiar with these proof techniques, and I have only superficially worked through these proofs.

We refer the reviewer to the common answer to all the reviewers. 

[1] arxiv:0907.3574
[2] arxiv:1803.06964
[3] proceedings.mlr.press/v206/clarte23a/clarte23a.pdf
[4] proceedings.mlr.press/v162/loureiro22a.html
}

\paragraph{Q4 Main Weakness:}
The setting that the authors consider seems somewhat restricted: First, they only consider a fixed ratio of dimension and sample size $\alpha = \frac{n}{d}$, so not even convergence of $\frac{n}{d}$ to some $\alpha$.
Second, and this is my major concern at the moment, the distribution of their samples looks quite restricted: The $x_i$'s are sampled from a normal distribution where the variance decreases with $d$. So for moderately large $d$, the $x_i$'s should all be approximately $0$, right? In other existing work, for example Karoui and Purdom [2018] there is at least some discussion on the distribution of the samples and why the assumptions are as they are.

\paragraph{Q5 Detailed Comments To The Authors:}
My main concern at the moment is that the setting looks very restrictive:

Major concern: The distribution of the samples looks quite restricted: 
\begin{itemize}
    \item The $x_i$'s are sampled from a normal distribution where the variance decreases with $d$. So for moderately large $d$, the $x_i$'s should all be approximately $0$, right? In other existing work, for example Karoui and Purdom [2018] there at least seems to be some discussion on the distribution of the samples, and if I see it correctly, the variance also does not decrease with increasing dimension $d$.

\item Minor concern: The setting that $\alpha = n / d$ also looks somewhat restrictive to me. Karoui and Purdom [2018] were looking at convergence of $n/d$ to some $\alpha$, which is more general. However, I also see that the setting you consider is difficult and starting with fixed $\alpha = n / d$ is still valuable.
\end{itemize}

Moreover: I think the presentation of the results in Section 4 can be improved. That is partly due to the mathematics being difficult, but also because:

\begin{itemize}
    \item You use the abbreviation "fr" in equation (16) before you introduce it a page later. I think it would help if you first introduce "fr" and then write down equation (16) \lc{DONE}
    \item You do not give much intuition behind the terms that appear in equation (16); I think at least roughly explaining some of these terms would help. Also, some terms in equation (16) look quite similar, for example $Q^t_{11}$ and $Q^{fr}_{11}$, I am wondering whether it is possible to just define $Q^t_{11}$ and additionally allow $t = fr$ next to $t = \{ pb, ss, jk \}$.
    \item I find the logical flow of Theorem 4 not optimal. You first introduce these self-consistent equations. However, at that point, the reader does not know how 
 $Q$ and $m$ are defined. I find that particularly problematic because I would have expected some notation like $Q^t$ or $m^t$, but because this does not appear, I wonder at this point how you stack all the $Q$-elements in equation (16) in one $Q$-matrix. The relieve only comes in equations (19)-(20). But also then, one somewhat needs to guess as a reader that one either puths $Q^t_{11}$,... and so on in the $Q$-matrix, or $Q^{fr}_{11}$.
\item I think you should introduce the particular product notation between 
 $p$ and $V$ when you introduce $G(p)$. \av{Fixed, removed the $\odot$ notation to make it clearer.}
\item In equation (21), you write $Q^{t,fr}_{12}$; that however has not been defined; you have only defined $Q^{fr,t}_{12}$. I suppose both terms are the same? \av{DONE}
\item In equation (23) you use "rb" and "rr" as abbreviations. What is "rr" standing for? I think you have not introduced that.
\item In equation (24) you call the refitted estimator $\hat{\theta}_{\lambda,k}$, however in Section 2.2, you call it $\hat{\theta}_k$ or $\hat{\theta}_\lambda( {D}_k^*)$. If both of the estimators are the same (which I think at the moment), why not use consistent notation?
\end{itemize}

Other detailed comments to the main paper:
\begin{itemize}
    \item Abstract: According to the UAI-template, mathematical formulas in the abstract are to be avoided. \lc{DONE}
    \item Introduction: You write: " ... not only in the absolute performance of $\hat{\theta}$ ...". What do you mean by absolute performance?
    \item Section 2: I find it somewhat confusing that you always write $p(y|z)$ for your likelihood because you have used $Z_i$ in the introduction, and now the $z$
 looks like there is some relation to $Z_i$, however, that is not really the case, right?
    \item Section 2: You denote the Gaussian likelihood by $p(y|z)=N(z|y,\Delta)$. First, I think you should at least once introduce the notation $N(z|y,\Delta)$ and particularly explain what the vertical bar means. Second, shouldn't it be $N(y|z,\Delta)$ instead? \lc{Fixed the typo}
 \item Section 2.2: You have not introduced $B$ before you use it in equations (11) and (12). \lc{DONE}
 \item Last paragraph of Section 2.2: You write $Bias_k$, shouldn't it be $Bias_t$ \lc{DONE}
 instead?
 \item Footnote 1: I think you should write "Since the $ {D}_k^*$'s are independent ..." to indicate that you mean several of the $ {D}_k^*$'s and not just one. \av{DONE}
\item Equation (14): Maybe you can add that $\theta\in\mathbb{R}^d$ as you have done for the other optimization problems. \lc{DONE}
 \item Section 4.1. Maybe you can refer to the exact section of Karaoui and Purdom [2018] when you cite the fact about the multinomial and Poisson distribution, given that this referenced paper is quite long. \av{DONE}
 \item Section 4.2: You write "The key difference is that the covariate $x_i$
 remain constant ...". There is a singular-plural issue here. Either you have "the covariates 
 remain ..." or "the covariate $x_i$
 remains ...". \av{DONE}
 \item References: You have the Karoui-references not all at the same place because you have different spellings of that name. That is confusing.
\end{itemize}

Regarding the supplement: For me, it seemed that there are rather many typos (and other issues) in the appendix; I think the authors should do better proofreading here. For example:

\begin{itemize}
    \item Section A: You have not introduced the abbreviation GAMP, only AMP, and that only in the main paper. I think it is good to introduce that abbreviation here again. \lc{DONE}
    \item Section A: You write in a paragraph regarding subsampling: "... from a Bernoulli distribution with probability $r\in(0,1]$ ...". In the main paper you had the open interval $r\in(0,1)$ \lc{Does not change anything in practice anyway}
.   \item Section A: In the paragraph that starts with "The novelty of ..." there are the following typos: "... study the performance of GAMP for several estimators" --> "...studying the performance of GAMP for several estimators. Next sentence, shouldn't it be "properties" instead of "property"?. Then, at the end of the paragraph, you also have "paralel"-->"parallel", "guaranties" --> "guarantees" and "independantly" --> "independently". \lc{DONE}
    \item In the next paragraph: "...coupled system of estimate." --> "...coupled system of estimates." \lc{DONE} Moreover, in your channel function, you sometimes seem to write $V$ in bold font, sometimes not. \lc{}
    \item Before equation (48): You write: "... consider constant sample weights $p_i=1\forall i$." There could be more emptyspace here in that mathematical expression in fron of $\forall i$. \av{DONE, added a $\backslash ;$}
    \item Section C.1.6: "cooresponds" --> "corresponds". \av{DONE}
\end{itemize}

The proofs, especially in Sections A and B require substantial background knowledge, I am not familiar with these proof techniques, and I have only superficially worked through these proofs.

References:
Karoui and Purdom [2018]: Noureddine El Karoui and Elizabeth Purdom. Can we trust the bootstrap in high-dimensions? the case of linear mod- els. Journal of Machine Learning Research, 19(5):1–66, 2018.

\paragraph{Q7 Justification For Your Score:} [...] I need further clarification regarding my above remarks on the restrictiveness of the paper's assumptions. If that clarification has been made and the other points I made are also sufficiently clarified, I am willing to increase my score.

\newpage
\section{Reviewer hSUt, score 3 confidence 3}

\lc{\textbf{Answer} We thank the reviewer for their comments. To adress the main weakness : 

* "Main results are stated in a very tedious and implicit form [...] " "The "proofs" in the appendix are basically impossible to check without doing all the work myself. [ ... ] "  We refer to the common answer to all reviewers where we address these two points : we will write the equations in Section 4 in a simpler form where possible, and we will add a section dedicated to the special case of subsampling where the equations of Theorem 4.1 are simpler and more understandable. Concerning the appendices, we will add all the steps required to derive the state evolution equations and explain better the core concepts of GAMP.

Concerning other comments : 

* "What is the black line in Figure 1?" This line corresponds to the posterior variance of the Bayes optimal estimator, defined in equation (8) and discussed in Section 5.

* "Theorem 4.2, variance (27): Does $Q^{rb}$ really solve (17), (18)? Or is it (25), (26)?" $Q^{rb}$ indeed solves (25), (26), we will fix this mistake.
}

\paragraph{Q5 Detailed Comments To The Authors:}
\begin{itemize}
    \item What is the black line in Figure 1? \av{It corresponds to the variance of the Bayes optimal estimator}
    \item Theorem 4.2, variance (27): Does $Q^{rb}$ really solve (17), (18)? Or is it (25), (26)? \av{Indeed, it should be (25) and (26), fixed.}
    \item p.4: You mention that the pair bootstrap is asymptotically equivalent to using iid Poisson weights. I think you need to be careful with the wording here. Indeed the sequence of weight vectors converges in distribution to iid Poisson weights. However, the "error" in this convergence may still affect the asymptotics for the bootstrap. In some cases, it was shown to be negligible (Van der Vaart and Wellner, Weak convegence, 2023, Chapter 3.7), but this is not automatic.
\end{itemize}

\paragraph{Q7 Justification For Your Score:}
[...] But as explained above, the author's don't provide enough detail to justify their theoretical results - at least to a degree I could reasonably check in this review. As a result, I consider them unproven claims for now, but am willing to change my score if appropriate details are added.

\newpage
\section{Reviewer 6Sqs, score 6 confidence 3}

\lc{\textbf{Answer : } 

x
}

\paragraph{Q4 Main Weakness:}
The authors do not fully explain how the resampling methods behave when $\alpha>1$ and $\alpha=O(1)$. Can the authors clarify this situation in the rebuttal stage?
\gd{That's what the plots are for, the asymptotics are large-alpha only}

\paragraph{Q5 Detailed Comments To The Authors:}
Can we use the derived theories to improve existing resampling methods?
\gd{An actually interesting question, which might give more weight to the paper if we can find a suitable answer. At least the AMP result helps choose between existing methods depending on the alpha regime, but can we do better?}

\end{document}