\section{Appendix}
\label{sec:appendix}


\subsection{Pooling layers}
\label{app:pooling}
The training loss curve and the validation accuracy curve for different pooling types used are given in Figure. \ref{fig:trainLoss-pool} and \ref{fig:valAcc_pool}. Here, \textit{Base model} refers to the Baseline LeNet-5 which uses Max-pool operation, \textit{Avg pool} refers to the model replacing Max-pool with Average-pool layers, and the \textit{s3p2\_s2} refers to strided convolutions replacing the Max-pool layers where the first convolution uses a $\text{stride}=3$ with $\text{padding}=2$ and the second convolution uses a $\text{stride}=2$.

\begin{figure}[htb]
    \centering
    \includegraphics[width=0.9\columnwidth]{Images/2.1_pooling_trainLoss.png}
    \caption{\centering Training loss with different pooling types}
    \label{fig:trainLoss-pool}
\end{figure}

\begin{figure}[htb]
    \centering
    \includegraphics[width=0.9\columnwidth]{Images/2.1_pooling_valAcc.png}
    \caption{\centering Validation accuracy with different pooling types}
    \label{fig:valAcc_pool}
\end{figure}

\subsection{Dropout}
\label{app:dropout}
In experimenting dropout, dropout before fully connected layers performed best. However, the performance of both 2D and 1D dropouts before the convolution layers gave competitive performance. The performance of different dropout choices is given in Table \ref{tab:dropout_app}. The validation accuracy curve for the best dropout for each setting is given in Figure \ref{fig:dropout}. \par 

\begin{table}[htb]
\caption{Dropout performances}
\label{tab:dropout_app}
\centering\renewcommand\cellalign{lc}
\setcellgapes{2pt}\makegapedcells
\resizebox{\columnwidth}{!}{\begin{tabular}[\columnwidth]{l c c}
    \toprule
    \textbf{Dropout type} & \textbf{Dropout Value} & \textbf{Validation accuracy}\\
    \bottomrule
    \multirow{3}{*}{1D before FC layers} & 0.1 & 65.79\\
    \cline{2-3}\\
    & 0.3 & 61.23\\
    \cline{2-3}\\
    & 0.5 & 60.56\\
    \midrule
    1D before all layers & 0.1 & 64.10\\
    \midrule
    \multirow{3}{*}{1D before CNN layers} & 0.1 & 63.38\\
    \cline{2-3}\\
    & 0.3 & 60.41\\
    \cline{2-3}\\
    & 0.5 & 57.18\\
    \midrule
    \makecell{1D for FC\\ 2D for CNN layers} & 0.1 & 62.76\\
    \bottomrule
\end{tabular}}
\end{table}

\begin{figure}[b]
    \centering
    \includegraphics[width=0.9\columnwidth]{Images/2.2_Dropout_valAcc.png}
    \caption{\centering Validation accuracy with different dropout settings.}
    \label{fig:dropout}
\end{figure}


\subsection{Batch normalization}
\label{app:BN}
Next, we examine the reason behind the improvement in performance when increasing BS to $80$. The statistical estimation that BN performs becomes more accurate as the BS increases since the batch becomes more representative of the distribution of the entire dataset. However, as the BS increases beyond a certain limit, mini-batch SGD becomes more like vanilla gradient descent, which performs sub-optimally for non-convex optimization problems due to local-minima-related reasons. At $BS = 256$, validation accuracy dips to $69.28\%$, as compared to $70.352\%$ when using a BS of $64$. Figure~\ref{fig:trainLoss-bnbs} presents the training loss curves and Figure~\ref{fig:valAcc_bnbs} the validation accuracy curves of the baseline models that include BN layers with different BSs. It can be inferred from Figure~\ref{fig:trainLoss-bnbs} that, as the BS increases, the model's bias increases as well since the weights are being updated based on bigger subsets of the dataset that provide a more general view of the population.


Using momentum with the optimizer, which is akin to the concept of velocity, accelerates convergence by taking into account past updates at each update step (weighted average). The momentum parameter determines the weight of the previous updates in calculating the new update. 

Finally, BN should ideally have a regularization effect on the model. However, We note that for almost all models with BN layers, even though the validation accuracy improved, the gap between the training and validation accuracies was wider as compared to that in the case of the base model with no BN layers. We can not be certain if this is a sign of over-fitting. Assuming it is, we speculate that playing with the momentum parameter of the BN layer, which controls the moving average, might help with that.

\begin{figure}[htb]
    \centering
    \includegraphics[width=\columnwidth]{Images/batch normalization/train-loss-bs.png}
    \caption{\centering Training loss with different batch sizes}
    \label{fig:trainLoss-bnbs}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=\columnwidth]{Images/batch normalization/validation-accuracy-bs.png}
    \caption{\centering Validation accuracy with different batch sizes}
    \label{fig:valAcc_bnbs}
\end{figure}


\subsection{Depthwise separable convolution}
\label{app:depthwiseConv}
Depthwise separable convolutions split the filter into two: a depthwise filter and a pointwise filter. A depthwise operation applies convolution on one channel at a time, preserving the number of channels in the feature map. A pointwise operation applies a convolution filter of size $1 \times 1$ to all channels at a time; the depth of the filter is exactly that of the feature map. The combination of these two operations serves as a replacement to the standard convolution operation, and the number of overall computations is heavily reduced.






 
 








 
 
















\section{Introduction}
Human pose estimation (HPE) is a classical task in computer vision that focuses on representing the orientation of a person by identifying the positions of their joints. HPE can be used to understand and analyze geometric and motion-related information of humans. The stacked-hourglass architecture presented by Newell et al. in \cite{hourglass} is one of the first compelling deep learning-based approaches to HPE, as classical approaches dominated HPE literature prior to it. In this work, repeated bottom-up and top-down processing is utilized to capture information from various scales and intermediate supervision is introduced to iteratively refine predictions at each stage. This led to a significant boost in accuracy as compared to state-of-the-art approaches at the time. \par 

However, HPE is meant to be a real-time application as it is often used as a precursor to another module. Thus, focus on computational efficiency is vital in this context. In this study, we implement architectural and non-architectural modifications on the stacked hourglass network to obtain a model that is both accurate and computationally efficient.

In what follows, we provide a brief description of the baseline model. The original architecture is made up of multiple stacked hourglass units, each of which is composed of four downsampling and upsampling levels. At each level, downsampling is achieved through a residual block and a max pooling operation, while upsampling is achieved with a residual block and naive nearest neighbor interpolation. This process ensures that the model captures both local and global information, which is important to coherently understand the full body for an accurate final pose estimate. After each max pooling operation, the network branches off to apply more convolutions through another residual block at the pre-pooling resolution, the result of which is added as a skip connection to the corresponding upsampled feature map in the second half of the hourglass. The output of the model is a heatmap for each joint that models the probability of a joint's presence at each pixel. Intermediate heatmaps after each hourglass are predicted upon which a loss is applied. In addition, these predictions are projected to a larger number of channels and act as the input to the subsequent hourglass, along with the input of the current hourglass and its feature map outputs. The source code and trained models are available at \href{https://github.com/jameelhassan/PoseEstimation}{https://github.com/jameelhassan/PoseEstimation}

\section{Design Choices}

\subsection{Depthwise Separable Convolutions}
Depthwise separable convolutions replace traditional convolutions to reduce the number of parameters of the convolution operation. This is performed by splitting the convolution using convolution spatially across the channels individually and then aggregating channel information through pointwise convolutions as in Figure \ref{fig:separable}.


\begin{figure}[ht]
    \centering
    \includegraphics[width=0.55\linewidth]{Images/dw_sep.png}
    \caption{\centering Depthwise separable convolution}
    \label{fig:separable}
\end{figure}

\subsection{Dilated Convolution}
A dilated convolution, described in Eq.~\ref{eq:dilated}, is a variant of a regular convolution operation that has the capacity to exponentially increase the receptive field  without losing resolution or coverage, as is the case with pooling operations. 

\begin{equation}
    \label{eq:dilated}
    (F*_{l}k)(p) = \sum_{s+lt=p}F(s)k(t)
\end{equation}

where $k$ is a discrete filter, $l$ is the dilation factor, and $*_{l}$ is an $l$-dilated convolution operation. A regular convolution corresponds to a 1-dilated convolution. Dilated convolutions have little to no effect on computational complexity \cite{mask3d}.

\subsection{Ghost Bottleneck}
The Ghost bottleneck proposed by \cite{ghostnet} also reduces the computational complexity of convolution operations by splitting the convolution differently. In order to produce a fixed number of channels, the Ghost bottleneck outputs a fraction of the channels using regular convolutions and the rest are produced through cheaper linear operations as in Figure~\ref{fig:ghost}. These are concatenated and convolved to output the required number of channels.  

\begin{figure}[bth]
    \centering
    \includegraphics[width=0.7\linewidth]{Images/ghost.png}
    \caption{Ghost bottleneck}
    \label{fig:ghost}
\end{figure}

\subsection{DiCE Bottleneck}
A Dimension-wise
Convolutions for Efficient Networks (DiCE) unit is a convolutional unit proposed by Mehta et al. in \cite{dicenet} that compromises dimension-wise convolutions followed by dimension-wise fusion. The convolution operation is applied across each of the three input dimensions (width, height, and depth). To combine the encoded information along each of these dimensions, an efficient fusion unit is used to combine these representations. Thus, a DiCE unit can efficiently capture information along the spatial and channel dimensions. 

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{Images/dicenet.png}
    \caption{\centering The DiCE unit introduced in \cite{dicenet} compared to separable convolutions}
    \label{fig:dicenet}
\end{figure}

\subsection{Shuffle Bottleneck}
The shuffle unit, first presented in \cite{shufflenet},  uses pointwise \textbf{group} convolutions and channel shuffling to both boost computational efficiency and maintain accuracy. 

\subsection{Perceptual Loss}
The perceptual loss is used to compare similar images with minor differences. Here, we use it as a feature level mean-squared error (MSE) loss between the two images, which computes the loss at a high-level feature map instead of the original image space. The assumption made here is that if the first hourglass is made to 'perceive' what the second hourglass 'perceives' at that high feature level, the overall performance of the network will improve. The total loss, presented in Equation~\ref{eq:loss}, consists of the perceptual loss and the original prediction losses with higher weight for the prediction losses.
\begin{equation}
    \label{eq:loss}
   
\mathcal{L} = \lambda \big (\alpha(\mathcal{L}_{HG1} + \mathcal{L}_{HG2})+ (1-\alpha)(\mathcal{L}_{percep}) \big )
\end{equation}

\subsection{Residual connections}
We also replace the existing addition of residual connections with concatenated residual connections followed by a pointwise convolution to obtain the required number of channels, referred to as $ResConcat$. We also include a residual connection from the narrowest feature map of the hourglass (the neck) to the next hourglass neck, referred to as $NarrowRes$.

\input{Tables/results.tex} 

\section{Experiments and Results}
Here, we consider our baseline to be the 2-stack variant of the stacked hourglass architecture, which has a validation mean PCKh score of $59.76\%$. All the subsequent experiments were carried out with a learning rate of $10^{-3}$, RMSprop optimizer, and a batch size of 24 (unless otherwise mentioned) for 20 epochs on a Quadro RTX 6000 GPU (24 workers). The training and evaluation of the model are carried out using the MPII dataset \cite{mpii} as is done in \cite{hourglass}. This dataset includes a diverse set of ~25k labeled (joints, scale, and center) images of ~40k people. The images are resized to $256x256$ after having been cropped to exclusively include the person who is the target of the labeling. 

The evaluation metric used to capture the performance of a model is the percentage of correct key points (PCKh@0.5) which measures if the predicted joint and the true joint are within 50\% of the head-bone link. We bold the best model in the last row, and each of the best performance metrics out of all models.


\subsection{Alternative bottlenecks}
\textbf{Ghost Bottleneck:} The bottleneck in the original model is replaced using a ghost module instead of the $3\times3$ convolution layer in the bottleneck as shown in Figure \ref{fig:ghost}. This enables us to reduce the number of parameters by more than half, however the accuracy of the model also suffers. \par 

\textbf{Shuffle Bottleneck:} Here we replace the original bottleneck with the ShuffleNet bottleneck. This vastly drops the number of parameters to less than 1M but has similar performance and MAdd operations count to Ghost bottleneck as seen in Table \ref{tab:results}. \par 

\textbf{DiCE Bottleneck:} The replacement of the bottleneck with the DiCE bottleneck gives a considerable drop in the number of parameters to $3.2M$ with very minimal loss in the PCKh score. 
 
\subsection{Dilated convolutions}

\begin{figure}
    \centering
    \includegraphics[width=0.72\columnwidth]{Images/dilated-bottleneck.png}
    \caption{\centering Multi-dilated bottleneck}
    \label{fig:multidilated}
\end{figure}

Next, we experiment with dilated convolutions. We run an experiment with a bottleneck of two $3\times3$ dilated convolutions with a dilation factor of 2 and a residual connection. PCKh increases to $60.2\%$, so we deduce that dilated convolutions have the capacity to boost performance. We proceed by running the same experiment but with separable convolutions, which assist in reducing compute from $15.29G$ to $6.78G$ while also reducing accuracy to $59.35\%$. To further balance out accuracy with compute, we use a multi-dilated bottleneck as the one proposed in \cite{lightweight}, which uses 3 parallel $3\times3$ separable convolutions with a dilation factor of 1, 2, and 3 respectively, sandwiched in between two 1x1 convolutions. First, we use this bottleneck exclusively in the hourglasses while maintaining the original bottleneck elsewhere, which results in a model that is both \textbf{lighter} than the baseline ($4.2M$ parameters and $7.63G$ Multiplication and Addition Operations (MAdds)) and with \textbf{better performance} $(59.89\%)$. In the final experiment, we use the multi-dilated bottleneck as a replacement for all bottlenecks in the network, which causes a drop in accuracy and compute. It is worthy of note that we keep the number of channels throughout the different bottlenecks constant in the aforementioned experiments.

\begin{figure*}[ht]
\centering
\includegraphics[width=0.75\textwidth]{Images/architecture-1.png}
    \label{fig:architecture}
    \caption{Architecture of the best model}
\end{figure*} 

\subsection{Modified baseline}
We then analyze the impact of different architectural and non-architectural changes on the baseline model. Specifically, we incorporate the following in a stagewise manner:
\begin{itemize}
    \item $\mathcal{L}_{percept}$
    \item Reduced number of channels
    \item Separable convolutions
    \item $ResConcat$
    \item $NarrowRes$
\end{itemize}

We initially incorporate the perceptual loss to the baseline. However, this drops the accuracy quite considerably by a margin of $5.2$. Since we want to achieve a lighter model, we then reduce the number of channels in the bottleneck layers. We find an optimum setting as $168-84$ (from $256-128$) from the baseline. This drops the accuracy marginally while having less than half the MAdds and number of parameters. We then incorporate fully separable convolutions for the bottleneck which further drops the number of parameters and MAdds while also giving an improvement of $0.5$ in the PCKh score. \par 


We then incorporate the $ResConcat$ and $NarrowRes$ residual connection changes to the model. This gives an improvement of $0.23$ with a minor increase in the number of params and MAdds. This addition accounts for our best model so far. We then include the perceptual loss which decreases the PCKh score by $4.49$. Here, we scaled the weighed perceptual loss (weight $\alpha=0.3$) by a factor $\lambda=2$. We hypothesize that since the loss is now smaller with weights of $0.7$ and $0.3$, we will require a higher learning rate or a scaling factor on the loss. Perhaps tuning these parameters further might enable the perceptual loss to increase the accuracy.

\section{Best Architecture}




After analyzing the results of the different experiments, the choice of the best architecture still remains debatable. To determine the best architecture, we compute a metric that is a weighted average of three other metrics (last column of Table~\ref{tab:results}) that jointly reflect the tradeoff between accuracy and compute: \% change in the number of parameters, \% change in MAdd operations, and \% change in mean PCKh with respect to the baseline. Using this metric, we conclude that the final row architecture resulted in the best model that strikes a balance between accuracy and time towards the goal of a lightweight model.

Thus, the best architecture consists of two stacked hourglasses, with lower number of channels $(168-84)$ using depthwise separable convolutions. They also include $ResConcat$ and $NarrowRes$ skip connections with the scaled, weighted perceptual loss.

\section{Conclusion}
We design a lighter stacked hourglass network with minimal loss in performance of the model. The lightweight 2-stacked hourglass has a reduced number of channels with depthwise separable convolutions, residual connections with concatenation, and residual connections between the necks of the hourglasses. The final model has a marginal drop in performance with 79\% reduction in the number of parameters and a similar drop in MAdds. Adding the proposed perceptual loss did not help increase performance as what was originally thought; we hypothesize that using a better loss such as Kullback–Leibler divergence loss, which is not as harsh of a heuristic as MSE is, would make this approach perform well, i.e.,``perceive" better. 

{\small
\bibliographystyle{ieee_fullname}
