\documentclass{article}
\usepackage{enumitem}
\newenvironment{QandA}{\begin{enumerate}[label=\bfseries\alph*.]\bfseries}
                      {\end{enumerate}}
\newenvironment{answered}{\par\normalfont}{}
% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\begin{document}

\section{Reviewer vhhe}
We are grateful for your time and sincerely appreciate your thoughtful review and your encouraging evaluation of our submission.
\begin{QandA}
   \item Please think about applications for the paper.
     \begin{answered}
     We thank the reviewer for their suggestion, as adding possible applications might encourage more people to explore this work. We shall include the following text in the introduction section to further motivate our proposed method.

     SPvR is particularly well-suited for deployment in resource-constrained environments, such as edge devices and mobile platforms, where model size and inference speed are critical. 
     For example, light weight image classification models that can run not only on latest flagship mobile devices but also on older generation phones.
     It also holds promise for large-scale ML services (MLaaS), where reducing computational overhead can lead to significant cost savings. 
     For example, faster and lighter image segmentation models.

     We would like to add that Sections 5.4 and 5.5 are dedicated to providing application of SPvR to mobile phones and image segmentation tasks.
     \end{answered}
\end{QandA}


\section{Reviewer LeYg}
Thank you for your thoughtful review and positive evaluation. We appreciate your insights and the time you invested in assessing our work.
\begin{QandA}
    \item The computational cost of the local grouping and global ranking modules, especially for very large models or datasets, is not thoroughly analyzed or compared against the overhead of other efficient pruning techniques.
    \begin{answered}
        % If $m_i$ is the number of neurons/filters in layer $i$ and $d$ is the group size then the number of comparisons required to generate all the groups would be $\sum_{i=1}^L\frac{m_i}{d}\log(d)$. 
        % As the group size increases, this quantity becomes significantly smaller. 
        % More importantly, the calculated cost for comparisons to produce the groupings is insignificant when compared to the computation cost for performing a forward pass through a network measured as FLOPS. 
        % Hence, in Section 3.2, we only talk about the number of forward passes required to produce the groupings which is $1$.
        % Similarly, as explained in Section 3.2, the ranking module requires $\sum_{i=1}^L\lceil\frac{m_i}{d}\rceil$ number of forward passes. 
        % Thus, the total number of forward passes required by SPvR to prune a network is $\sum_{i=1}^L\lceil\frac{m_i}{d}\rceil+1$. 
        % This is a one-time cost. 
        % Dynamic pruning methods such as TaylorFO-BN [Molchanov et al., 2019] and ThiNet [Luo et al., 2017]): These methods compute filter importance after each pruning step, often requiring an extra forward/backward pass and ranking for each layer multiple times.
        Let $n$ be the number of samples in a dataset, $b$ be the batch size, $L$ the number of layers in a deep neural network, $m_i$ the number of neurons/filters in $i$-th layer and $t$ the number of epochs. 
    Considering the CIFAR10 dataset and the VGG16 network, $n=50000,b=128,L=16,t=200$. 
    The following are the overhead computations for each method.
    \begin{enumerate}
        \item \textit{$\ell_1$ or $\ell_2$ Norm -} Simply iterate through each layer and prune the filters. 
        Since no data is required, the number of forward passes is $0$.
        
        \item \textit{Taylor (First Order) -}  An iterative process of fine-tuning followed by pruning some filters every $k=10$ mini-batches.
        \begin{equation}
            2\left\lceil\dfrac{n}{b}\right\rceil k=2\left\lceil\dfrac{50000}{128}\right\rceil 10=7.82\times 10^3
        \end{equation}

        \item \textit{FPGM -} An iterative process of fine-tuning followed by pruning after every epoch. 
        The pruning process at every epoch requires iterating through the entire network, finding the geometric median in some constant time $c$ (a large positive quantity) for each layer and then pruning some of the filters 
        \begin{equation}
            2\left\lceil\dfrac{n}{b}\right\rceil + Lc = 2\left\lceil\dfrac{50000}{128}\right\rceil + 16c = 0.782\times 10^3 + 16c
        \end{equation}

        \item \textit{RCP -} Generate $N=20$ multiple copies of the same network, randomly prune these copies and then train. 
        Select the top $N'=10$ models, fine-tune them and select the best-performing pruned network. 
        Finally, re-train this model from scratch
        We were unable to train the $20$ and $10$ sub-networks in parallel as recommended by the original paper due to a lack of resources. 
        Hence, we consider their sequential training.
        \begin{equation}
            2\left\lceil\dfrac{n}{b}\right\rceil t N + 2\left\lceil\dfrac{n}{b}\right\rceil tN'  = 2\left\lceil\dfrac{50000}{128}\right\rceil 200 \left(20+10\right) = 4692\times 10^3
        \end{equation}

        \item \textit{HRank -} Perform $g=500$ forward passes and for each pass, compute the rank of each filter (which takes some large constant time $c$) in every layer, prune the least important filters and finally fine-tune the pruned network. 
        \begin{equation}
            g\sum_{i=1}^Lm_ic = 500\times 4224c = 2112c \times 10^3
        \end{equation}

        \item \textit{CURL -} Perform a forward pass for each filter in the network to compute its importance, prune the least important filters and finally fine-tune the pruned network. 
        \begin{equation}
            \sum_{i=1}^Lm_i = 4.224\times 10^3
        \end{equation}

        \item \textit{NISP -} Find the feature importance of the penultimate layer using all the samples in the dataset, back-propagate these importance scores in a single pass, prune the network and finally fine-tune the pruned model.
        \begin{equation}
            \left\lceil\dfrac{n}{b}\right\rceil + 1 = 0.392 \times 10^3
        \end{equation}

        \item \textit{OTOv2 -} A network is first warmed up for $t_w=100$ epochs following which the model is trained for $\dfrac{t}{2}$ epochs. 
        \begin{equation}
            2\left\lceil\dfrac{n}{b}\right\rceil t_w = 2\left\lceil\dfrac{50000}{128}\right\rceil \left(100\right) = 78.2\times 10^3
        \end{equation}

        \item \textit{SPvR -} Perform a single forward pass to determine filter/neuron groups, then for each filter/neuron group of size $d=4$ perform a forward pass to determine its importance. Finally, prune the least important filters/neurons and train the pruned network.
        \begin{equation}
            1 + \sum_{i=1}^L\dfrac{m_i}{d} = 1 + 1056 = 1.057 \times 10^3
        \end{equation}
    \end{enumerate}

    According to our computation analysis, $\ell_1$ and $\ell_2$ norm pruning methods have no overhead followed by CURL, NISP, TaylorExpansion (First Order) and SPvR. 
    RCP has the highest overhead followed by HRank, FPGM and finally OTOv2.
    \end{answered}

    \item The paper does not explore the effectiveness of SPvR for other common tasks where resource-constrained HW is needed, such as keyword spotting, sensor data analysis, or anomaly detection, which have different network architectures and resource constraints. Different applications in Table 1 will bring more community to explore this work.
    \begin{answered}
        We thank the reviewer for their suggestion, as adding possible applications might encourage more people to explore this work. We shall include the following text in the introduction section to further motivate our proposed method.

         SPvR is particularly well-suited for deployment in resource-constrained environments, such as edge devices and mobile platforms, where model size and inference speed are critical. 
         For example, light weight image classification models that can run not only on latest flagship mobile devices but also on older generation phones.
         It also holds promise for large-scale ML services (MLaaS), where reducing computational overhead can lead to significant cost savings. 
         For example, faster and lighter image segmentation models.

         We would like to add that Sections 5.4 and 5.5 are dedicated to providing application of SPvR to mobile phones and image segmentation tasks.
    \end{answered}
\end{QandA}

\section{Reviewer pNH6}
Thank you for taking the time to review our work. We value your feedback and appreciate your constructive criticism, which helps us improve the clarity and quality of our submission.
\begin{QandA}
    \item While, I could not find the exact method as described here during my literature search, More comparison is required with other ranking approaches such as Hrank. I could not understand the reason why this approach would outperform Hrank etc.
    \begin{answered}
        We clarify that our method, SPvR, introduces a structured ranking mechanism that combines local grouping with global evaluation to prune both filters/neurons and entire layers. 
        Unlike most existing methods, SPvR reduces not only the width but also the depth of the network—an aspect that significantly impacts the degree of parallelism and thus the throughput (TOPS) on hardware accelerators, as also noted by Reviewer LeYg. 
        To the best of our knowledge, such a structured pruning approach is novel.
        
        Regarding comparison with HRank: We have included HRank as a baseline in all our experimental evaluation (see Tables 1 and 3 and Figure 1), and SPvR consistently outperforms HRank across multiple architectures and datasets. 
        We believe this improvement stems from the following two core differences:
        \begin{itemize}
            \item \textit{Rank vs. Response-based Evaluation:} HRank evaluates filters based on the rank of the layerwise activations, which has been experimentally evaluated to capture redundancy but does not explicitly quantify importance. 
            In contrast, SPvR evaluates each filter or group’s contribution to the model’s output through forward passes, providing a more direct measure of utility.
            \item \textit{Shallow but Wider Architectures:} 
            At higher pruning rates, HRank produces thinner networks while SPvR produces shallower networks that are relatively wider (see Section C in the Supplementary Material). 
            Wide shallow networks have been shown to outperform deep thinner counterparts [1].
        \end{itemize}
    \end{answered}

    \item I am also not sure about the computational complexity which the authors claim is $O(n)$. Constructing the matrix in grouping module is itself $\Omega(n^2)$ right?
    \begin{answered}
        We thank the reviewer for pointing out this potential confusion. 
        In Section 3.2, we define $n$ as the number of training samples. 
        The grouping module does not construct a matrix over these samples, and hence its complexity is not
        $\Omega(n^2)$. 
        Instead, the grouping operates over the neurons or filters within each layer, and the associated cost is independent of $n$.
        Specifically, for layer $i$ with $m_i$ fitlers or neurons and a group size of $d$, an efficient implementation of our grouping module results in the following number of comparisons. 
        \begin{equation}
            \begin{split}
                d\sum_{i=1}^L\sum_{j=1}^{k}\log(jd) &= d\sum_{i=1}^L\left(\log(k!) + k\log(d)\right)\\
                &\approx d\sum_{i=1}^L\left(k\log(k)-k + k\log(d)\right)\\
                &=d\sum_{i=1}^L\frac{m_i}{d}\left(\log(\frac{m_i}{d})-1 + \log(d)\right)\\
                &=\sum_{i=1}^Lm_i\left(\log(m_i)-1\right)
            \end{split}
        \end{equation}
    \end{answered}
    Here, $k=\frac{m_i}{d}$. 
    The second equation arises from Stirling's Approximation.
    
    \item Another confusing claim was that - The authors state they train the model from scratch using random initialisation not related to the original initialisation. This isn't a weakness by itself. I believe I am missing something small but important aspect here. Can the authors refute the following argument?
    There is no information as such which is used for training the model post pruning except the number of neurons/layers. This implies that, the authors basically identified a class of smaller models which is tuned for this dataset. This is more in tune with neural architecture search (NAS).
    \begin{answered}
        We appreciate the reviewer’s thoughtful observation and the opportunity to clarify this important point. 

        While it is true that our final model is trained from scratch using a random initialization unrelated to the original weights, the key insight is that the structure of this model is obtained through a data-driven pruning process applied to a trained overparameterized network. 
        This distinguishes our approach from traditional NAS methods, which often search over an architecture space and require training multiple candidate models during search. 
        
        In our case, the structure is discovered by evaluating the importance of groups of neurons/filters in the original network using forward passes on real data. 
        This retains useful dataset-specific inductive biases in the final architecture, even though the weights are re-initialized.
        
        Hence, SPvR can be viewed as a pruning-based compression method rather than a search-based architecture discovery method. 
        The goal is not to discover novel architectures but to identify a compact subnetwork of a pre-trained model with strong inductive alignment to the data. 
    \end{answered}

    \item Can the authors provide the comparison with fine-tuning on pruned model?
    \begin{answered}
        We thank the reviewer for raising this point. We would like to clarify that Table 2 in the main paper provides a direct comparison between fine-tuning and training from scratch on pruned models across multiple datasets. Additionally, Section 5.2 is specifically devoted to analyzing this comparison in greater detail.
    \end{answered}

    \item How can one adjust the pruning ratio using this method? Say one wants to obtain $90\%$ pruning?. What would be the accuracy of the pruned model Resnet50+Imagenet1k if the pruning is increases to $90\%$?
    \begin{answered}
        We thank the reviewer for this insightful question. 
        SPvR is built around the idea that the number of parameters to be pruned is determined by a user-supplied hyper-parameter. 
        As explained in Section 3.2, once the neurons/filters are grouped, the importance of each group across all layers is assessed using the ranking module.
        These groups are globally sorted throughout the model, and the least important ones are pruned away until the desired parameter count is reached. 
        In the case that a user wants to remove $90\%$ of the parameters, the same process is followed until only $10\%$ of the parameters remain. 

        Table 4 demonstrates the performance of a $94\%$ pruned ResNet50 model in comparison to MobileNetV3, on the ImageNet1K dataset. 
        Our pruned network achieves a Top-1 score of $65.20\%$ which is $3.3\%$ higher than MobileNetV3-minimal, despite using $20\%$ fewer parameters.
    \end{answered}
\end{QandA}
\textbf{References}

\noindent [1] Zagoruyko, Sergey, and Nikos Komodakis. "Wide Residual Networks." In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
\end{document}