% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
% \documentclass[accepted]{uai2024_former} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{makecell}
\usepackage{natbib}


\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}
\def\UrlBreaks{\do\A\do\B\do\C\do\D\do\E\do\F\do\G\do\H\do\I\do\J
\do\K\do\L\do\M\do\N\do\O\do\P\do\Q\do\R\do\S\do\T\do\U\do\V
\do\W\do\X\do\Y\do\Z\do\[\do\\\do\]\do\^\do\_\do\`\do\a\do\b
\do\c\do\d\do\e\do\f\do\g\do\h\do\i\do\j\do\k\do\l\do\m\do\n
\do\o\do\p\do\q\do\r\do\s\do\t\do\u\do\v\do\w\do\x\do\y\do\z
\do\.\do\@\do\\\do\/\do\!\do\_\do\|\do\;\do\>\do\]\do\)\do\,
\do\?\do\'\do+\do\=\do\#}
\usepackage{booktabs}       % professional-quality tables
\usepackage{algorithm}
\usepackage[noend]{algorithmic}
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{graphicx}       % graph
\usepackage{subfigure}
\usepackage{wrapfig, blindtext}
\usepackage{caption}        % caption
% \usepackage{subcaption}     % sub-caption
\usepackage[space]{grffile} % include graphics with 'spaced' name
\usepackage{float}
\usepackage{multirow}
\usepackage{comment}
\usepackage{makecell}
\usepackage{enumitem}
\usepackage{colortbl}

% If accepted, instead use the following line for the camera-ready submission:
% \usepackage[accepted]{icml2023}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{lma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}


% Attempt to make hyperref and algorithmic work together better:
% \newcommand{\theHalgorithm}{\arabic{algorithm}}
\newcommand{\tabincell}[2]{\begin{tabular}{@{}#1@{}}#2\end{tabular}}  

\title{AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop}


% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2024 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{\href{mailto:<jw5665@nyu.edu>?Subject=AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop}{Jing Wang}}
\author[1]{\href{mailto:<yt1208@nyu.edu>?Subject=AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop}{Yunfei Teng}}
\author[1]{\href{mailto:<ac5455@nyu.edu>?Subject=AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop}{Anna Choromanska}}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% % Add affiliations after the authors
\affil[1]{%
Department of Electrical and Computer Engineering, New York University \vspace{-0.3in}
}
% \affil[2]{%
%      yt1208@nyu.edu\\
%     New York University\\
%  }
%  \affil[3]{%
%      ac5455@nyu.edu\\
%     New York University\\
%    }
  
\begin{document}
\maketitle

\begin{abstract}

   Modern deep learning (DL) architectures are trained using variants of the SGD algorithm and typically rely on the user to manually drop the learning rate when the training curve saturates. In this paper, we develop an algorithm, that we call AutoDrop, that realizes the learning rate drop automatically and stems from the properties of the learning dynamics of DL systems. Specifically, it is motivated by the observation that the angular velocity of the model parameters, i.e., the velocity of the changes of the convergence direction, for a fixed learning rate initially increases rapidly and then progresses towards soft saturation. At saturation, the optimizer slows down thus the angular velocity saturation is a good indicator for dropping the learning rate. After the drop, the angular velocity “resets” and follows the pattern described above, increasing again until saturation. AutoDrop is built on this idea and drops the learning rate whenever the angular velocity saturates. The method is simple to implement, computationally cheap, and by design avoids the short-horizon bias problem. We show that AutoDrop achieves favorable performance compared to many different baseline manual and automatic learning rate schedulers, and matches the SOTA performance on all our experiments. On the theoretical front, we claim two contributions: we formulate the learning rate behavior based on the angular velocity and provide general convergence theory for the learning rate schedulers that decrease the learning rate step-wise, rather than continuously as is commonly analyzed.
\end{abstract}

\vspace{-0.15in}
\section{Introduction}\label{sec:intro}
% \vspace{-0.1in}
As data sets grow in size and complexity, it becomes more difficult to pull useful features from them using hand-crafted feature extractors. For this reason, the DL frameworks \citep{Goodfellow-et-al-2016} are now widely popular. DL frameworks process input data using multi-layer networks and automatically find high-quality representations of complex data useful for a particular learning task. Today DL approaches are generally recognized as superior to all alternatives for image \citep{NIPS2012_4824,He2016DeepRL}, speech \citep{DBLP:conf/icassp/Abdel-HamidMJP12}, and video \citep{KarpathyCVPR14} recognition, image segmentation \citep{chen2016deeplab}, and natural language processing \citep{DBLP:conf/emnlp/WestonCA14}. Furthermore, 
% DL is the leading artificial intelligence technology in major tech companies such as Facebook, Google, Microsoft, and IBM, as well as in countless start-ups, where it is used for a plethora of learning problems including content filtering, photo collection management, topic classification, search/ads ranking, video search and indexing, and copyrighted material detection.
DL is the primary AI technology at major tech companies like Facebook, Google, Microsoft, and IBM, as well as numerous startups, utilized for various learning tasks such as content filtering, photo management, topic classification, search/ads ranking, video indexing, and copyright detection.

Setting the values and schedules of the hyperparameters for training DL models is computationally expensive and time-consuming, e.g., a deep model with around ten billion parameters requires roughly $500$ GPUs to be trained in around two weeks \citep{MegatronLM}. Among all hyperparameters used when training DL models, the learning rate schedule is one of the most important \citep{jin2021autolrs}. For most SOTA DL architectures, the learning rate is dropped several times during training at epochs chosen by the user, i.e., the learning rate is dropped at the predefined epochs, typically when the training loss is expected to saturate. 
% With the growing sizes of modern architectures, however, performing any manual tuning of the hyperparameters will eventually become prohibitive. More efficient techniques that allow automatic and online setting of hyperparameters translate to substantial savings of resources, time, and money (today the cost of training a single SOTA DL model reaches up to hundreds of thousands of dollars \citep{Money}).   
As modern architectures grow larger, manual hyperparameter tuning becomes impractical. Efficient techniques enabling automatic and online hyperparameter adjustment yield significant resource, time, and cost savings  (today the cost of training a single SOTA DL model reaches up to hundreds of thousands of dollars \citep{Money}).

The automatic learning rate schedule is an open and important problem -- having a simple and effective scheme would be very useful, conceptually and practically. This paper addresses the challenge of developing an automatic method for adjusting the learning rate that works in an online fashion during network training. 
Our approach looks at the problem of automatic learning rate schedulers from a different perspective than the prior works. Existing automatic learning rate schedulers \citep{donini2020marthe, yang2019228, gunes2018online, luca2017hyper, retsinas2022trainable} are gradient-based meta-optimization methods that treat the learning rate as a trainable parameter. They suffer from the short-horizon problem \citep{wu2018understanding}, which arises when the optimizer becomes overly greedy and focuses solely on minimizing the loss at the current state.
The basis for our approach is rooted in a novel concept. We ask: what are the good descriptors of the learning dynamics of DL systems that can guide the automatic learning rate drop? We find that the angular velocity of the model parameters, defined in the very end of this section, is an excellent indicator of the dynamics of the convergence of an optimizer and can be easily used to trigger the learning rate drop during network training. Our algorithm, Autodrop, drops the learning rate whenever the angular velocity saturates. The resulting algorithm that we obtain is extremely simple, it can be used on the top of any DL optimizer (SGD \citep{bottou-98x}, momentum SGD \citep{Polyak1964}, ADAM \citep{kingma2015}, etc.), and enjoys an elegant theoretical framework. 
Moreover, since AutoDrop decays the learning rate only if the optimizer starts to oscillate around the minima, it avoids the short-horizon bias problem that stigmatizes other automatic learning rate techniques.
% We empirically demonstrate that our method either matches or accelerates the training of DL models and leads to comparable or better generalization compared to SOTA techniques. 
We empirically demonstrate that our method matches the training of DL models and leads to comparable or better generalization compared to SOTA techniques. 

% At the same time, AutoDrop does not require any extra hyper-parameter tuning compared to vanilla optimization schemes, like SGD. Specifically, the additional hyper-parameters that the method introduces (namely the window size for computing the angular velocity, the learning rate drop factor, and the buffer size for Gaussian smoothing) are fixed across all our experiments (with different data sets and architectures) and we provide both a discussion justifying their setting as well as illustrative ablation study. This makes our method quite unique since, to the best of our knowledge, all other existing automatic learning rate schedulers introduce extra hyper-parameters that need to be tuned for every data set and architecture. Additionally, since AutoDrop decays the learning rate only if the optimizer starts to oscillate around the minima, it avoids the short-horizon bias problem that stigmatizes other techniques.

Finally, we claim two important theoretical contributions. 
% Firstly, we formulate the learning rate behavior based on the angular velocity model that we propose. Secondly, we develop a general convergence proof technique that not only supports 
Firstly, we formulate the learning rate behavior using our proposed angular velocity model. Secondly, we develope a general convergence proof technique applicable not only supports
AutoDrop (Theorem~\ref{thm:sgdm_conv}), but is also applicable to any learning rate schedulers that decrease the learning rate step-wise. Most proofs for gradient-based methods require the learning rate to decrease continuously. Our theorems instead support discrete learning rate drop.

This paper is organized as follows: Section \ref{sec:relate} discusses the related work, Section \ref{sec:ME} builds an intuition for understanding our algorithm based on simple examples, Section \ref{sec:Alg} shows our algorithm, Section \ref{sec:Theory} captures the theoretical guarantees, Section \ref{sec:ER} presents experimental results, and Section \ref{sec:Con} concludes the paper. 

% \vspace{-0.05in}
\begin{definition}[Angular velocity]\label{def:av}
Define the angular velocity of model parameters as:

\vspace{-0.05in}
\begin{equation}
\omega_i = \angle(s_i,s_{i-1}), \:\:\:\text{where}\:\:\:s_i = x_{i+1} - x_i
\label{eq:defAV}
\end{equation}
% \vspace{-0.3in}

and $x_i$ is the parameter vector in the end of the $i^{\text{th}}$ iteration. The operator $\angle(\cdot, \cdot)$ calculates the angle between two vectors and is defined as:

\vspace{-0.3in}
\begin{equation}
\angle(s_i,s_{i-1}) = \frac{180^{\circ}}{\pi} \cdot \arccos\left(\frac{s_i^\intercal s_{i-1}}{||s_i|| ||s_{i-1}|| + \epsilon}\right),
\label{eq:DefAG}
%\vspace{-0.1in}
\end{equation}
% \vspace{-0.3in}
% \newpage
where $\epsilon$ is a small positive number preventing the division by zero\footnote{$\epsilon$ is omitted in the theoretical derivations.}\footnote{An interpretation of angular velocity could be found in the Supplement (Section \ref{sec:int_av2}).}.
\label{def:defAV}
\end{definition}

\section{Related Work}\label{sec:relate}
\vspace{-0.1in}

In this section, we summarize different types of learning rate schedulers and divide them into four main categories. \textit{Scheduling-based methods} rely on a carefully designed learning rate schedules that are tailored to the non-convex nature of the deep learning optimization. 
Cyclical learning rate (CLR) \citep{smith2017cyclical} use the cyclical learning rate pattern to train DL models and apply a triangular learning rate policy in each cycle (that is, first increase and then decrease the learning rate linearly in the cycle) to potentially allow for a more rapid traversal of saddle point plateaus. \citep{smith2018super} extends CLR to super-convergence policy OneCycle with only one triangular cycle. \citep{li2019exponential} exponentially decreases the learning rate and achieves better performance than the constant learning rate. \citep{agarwal2021acceleration} utilize Chebyshev polynomials in constructing the Chebyshev learning rate schedule, aimed at accelerating vanilla gradient descent. They illustrate that addressing instability issues results in a fractal ordering of step sizes.
% More specifically, it was proposed in \citep{smith2017cyclical} to use the cyclical learning rate pattern to train DL models and apply a triangular learning rate policy in each cycle (that is, first increase and then decrease the learning rate linearly in the cycle) to potentially allow for a more rapid traversal of saddle point plateaus. 
% This idea was further extended to the super-convergence policy \citep{smith2018super} where there is only one triangular cycle for the entire training process. 
% This concept was also applied to other hyperparameters, e.g.:, momentum coefficient  \citep{smith2018disciplined}. Cyclical learning rates were also used in \citep{loshchilov2016sgdr}, where the authors combine them with restart techniques when training deep neural networks. The authors decreased the learning rate from a maximum value to a minimum value using a cosine annealing scheme, and then periodically restarted the process. All these methods define the learning rate policy manually (this process heavily depends on the properties of model and data set), thus they constitute deterministic scheduling methods. 
Another approach~\citep{pesme2020convergence} builds on the top of the Convergence-Diagnostic algorithm~\citep{pflug1990non,chee2018convergence} that examines the running average of successive gradients' inner products to develop a stopping criterion for the optimizer. The authors expand this idea to build an automatic learning rate adjustment mechanism relying on decreasing the learning rate when a negative inner product is detected. In \citep{jin2021autolrs}, a Gaussian process surrogate model is employed to link the learning rate and expected validation loss. The approach iteratively updates a posterior distribution of validation loss and dynamically searches for the optimal learning rate based on this posterior. Careful design of an acquisition function and forecasting model is necessary to ensure accurate prediction of the validation loss posterior.

Another group of techniques \textit{hypergradient-based methods} \citep{donini2020marthe, yang2019228, gunes2018online, luca2017hyper} that optimize both the model parameters and the learning rate simultaneously.
The authors of these methods typically introduce a hypergradient that is defined as a gradient of the validation error with respect to the learning rate schedule. The learning rate is optimized online via gradient descent.
These techniques however are quite sensitive to the choice of the hyperparameters. Recently, \citep{retsinas2022trainable} has presented a second-order hypergradient method which removes extra hyperparameters from training. However, as indicated in \citep{wu2018understanding}, all hypergradient methods are struggling to reach SOTA performance due to the existence of short-horizon bias. The reason behind it is that all these methods naturally choose the step size that only minimizes the short-term loss, and thus the optimizer tends to ignore the flat region of an ill-conditioned loss surface. A comprehensive discussion on this matter is included in \citep{wu2018understanding}. 

\textit{Hyperparameter optimization methods} aim to automatically find a good set of hyperparameters offline. They either build explicit regression models to describe the dependence of
target algorithm performance on hyperparameter settings \citep{hutter2011}, or optimize hyperparameters by performing random search along with using greedy sequential methods based on the expected improvement criterion \citep{bergstra2011}, or use bandit-based approach for hyperparameter selection \citep{li2018hyperband}. These techniques can be combined with Bayesian optimization \citep{falkner2018, zela-automl18}. Recently, several parallel methods have also been proposed for hyperparameter tuning \citep{jaderberg2017population, li2020population, holder2020prov, li2020system} as well. The hyperparameter optimization methods are computationally expensive in practice. 

Popular \textit{adaptive learning rate optimizers} adjust the learning rate for each parameter individually based on gradient information from previous iterations. AdaGrad \citep{duchi2011} proposes to update each parameter using a different learning rate which is proportional to the inverse of the past accumulated squared gradients of the parameter. Thus, the parameters associated with larger accumulated squared gradients have smaller step sizes. This method is enabling the model to learn infrequently occurring features, as these features might be highly informative and discriminative. The major weakness of AdaGrad is that the learning rates continually decrease during training and eventually become too small for the model to learn. Later on, RMSprop \citep{tieleman2012lecture} and Adadelta \citep{zeiler2012adadelta} were proposed to resolve the issue of diminishing learning rate in AdaGrad. Instead of directly summarizing the past squared gradients, both methods maintain an exponential average of the squared gradients, which is used to scale the learning rate of each parameter. The exponential average of the squared gradients could be considered as an approximation to the second moment of the gradients. One step further, ADAM \citep{kingma2015} estimates both the first and second moments of the gradients and uses them together to update the parameters. To summarize, adaptive learning rate optimizers adjust the step size for each parameter independently based on the gradient information from past iterations in order to speed up the convergence compared to vanilla SGD. These methods still require a universal learning rate to adjust the overall step sizes. AutoDrop is designed to update automatically this universal learning rate and thus could be applied on the top of this class of optimizers.

Finaly, some works focus on novel strategies involving gradient computations in order to enhance the performance of optimizers, e.g.:,~\citep{cohen2020gradient, zhang2019lookahead,izmailov2018averaging}. In our method, we do not change the computation of gradients but put forward a novel automatic learning rate scheduler. Our AutoDrop can therefore be applied on the top of some of these techniques.

% \vspace{-0.07in}
\section{Motivating Example}
\label{sec:ME}
% \vspace{-0.13in}
% \begin{figure}[t]
% % \vspace{-0.15in}
% \centering
% \includegraphics[trim={0 0 0 1cm},clip,width=.23\textwidth]{Figures/Loss_A3_main.jpg}
% \includegraphics[trim={0 0 0 1cm},clip,width=.23\textwidth]{Figures/Angle_A3_main.jpg}
% % \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM}
% % \vspace{-0.1in}
% \end{figure}

We analyze the properties of the angular velocity for a noisy quadratic model. While simple, this model is used as a proxy for
analyzing neural network optimization \citep{schaul2013pesky,pmlr-v37-martens15,zhang2019lookahead}.
% \citep{schaul2013pesky,pmlr-v37-martens15,zhang2019lookahead,zhang2019algorithmic}.

\vspace{-0.02in}
\begin{definition}[Noisy Quadratic Model]
We use the same model as in \citep{zhang2019lookahead}. The model is represented by the following loss function
\vspace{-0.05in}
\begin{align}\label{eq:loss}
    L(x)=\frac{1}{2}(x-c)^\intercal A(x-c),
\end{align}
\vspace{-0.25in}

where $c\sim N(x^*,\Sigma)$ and both $A$ and $\Sigma$ are diagonal. Without loss of generality, we assume $x^*=0$. 
\label{def:NQM}
\end{definition}

\vspace{-0.1in}

The update formula for the gradient descent at the step $t+1$  is given as ($\alpha$ is the learning rate):
\begin{align*}
    x_{t+1}\!=x_t-\alpha\nabla L(x_t)=x_t-\alpha A(x_t-c_t),c_t\sim N(0,\Sigma).
    \vspace{-0.05in}
\end{align*}
% \begin{figure}[t]
% % \vspace{-0.15in}
% \centering
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Loss_A3_main.jpg}
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Angle_A3_main.jpg}
% \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM}
% \vspace{-0.25in}
% \end{figure}
We optimize noisy quadratic model with $x\in\mathbb{R}^{200}$ and $A=diag(\frac{1}{10},\frac{2}{10},...,\frac{200}{10})$ using Gradient Descent (GD), where $\alpha = [0.06, 0.03, 0.01, 0,001]$. Figure~\ref{fig:NQM} reveals the following properties:

% \begin{figure}[t]
% % \vspace{-0.15in}
% \centering
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Loss_A3_main.jpg}
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Angle_A3_main.jpg}
% \vspace{-0.1in}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM}
% \vspace{-0.1in}
% \end{figure}

% \begin{figure}[t]
% % \vspace{-0.2in}
% \centering
% \includegraphics[trim={0.25cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Loss_main3.jpg}
% % \hspace{-0.05in}
% \includegraphics[trim={0.2cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Angle_main3.jpg}
% \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for the noisy quadratic model. An optimizer is using an automatic drop of the learning rate guided by the saturation of the angular velocity. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM2}
% \vspace{-0.22in}
% \end{figure}
\begin{figure*} 
\begin{minipage}[t]{0.5\linewidth} 
\centering 
\includegraphics[trim={0 0 0 1.3cm},clip,width=.49\textwidth]{Figures/Loss_A3_main.jpg}
\includegraphics[trim={0 0 0 1.3cm},clip,width=.49\textwidth]{Figures/Angle_A3_main.jpg}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.} 
\vspace{-0.15in}
\caption{Loss and angular velocity with fixed learning rate for noisy quadratic model.}
\label{fig:NQM} 
\end{minipage}% 
\hfill
\begin{minipage}[t]{0.48\linewidth} 
\centering 
\includegraphics[trim={0.25cm 0 0.3cm 1cm},clip,width=0.49\textwidth]{Figures/Loss_main3.jpg}
% \hspace{-0.05in}
\includegraphics[trim={0.2cm 0 0.3cm 1cm},clip,width=0.49\textwidth]{Figures/Angle_main3.jpg}
% \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for the noisy quadratic model. An optimizer is using an automatic drop of the learning rate guided by the saturation of the angular velocity. Angular velocity is averaged over $20$ iterations.} 
\vspace{-0.15in}
\caption{Loss and angular velocity with dropped learning rate for noisy quadratic model.}
\vspace{-0.55in}
\label{fig:NQM2} 
\end{minipage} 
\end{figure*}

\begin{itemize}[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt, leftmargin=0.3in]
    \item [(P1)] \textbf{Angular velocity saturation:} the angular velocity curves\footnote{For the noisy quadratic model, the angular velocity (given in Definition~\ref{def:defAV}) is computed with respect to one iteration, rather than an epoch, as for this model there is no notion of the epoch.} have the tendency to saturate as the training proceeds, and furthermore when the angular velocity enters the saturation phase, the optimizer slows down its convergence,
    \item [(P2)] \textbf{Angular velocity saturation levels:} i) if the learning rate is large enough such that the algorithm cannot converge to the optimum, the angular velocity saturates at a level between $90$ degrees and $120$ degrees; ii) as the learning rate decreases, and the algorithm systematically converges closer to the optimum, the angular velocity saturates at progressively lower levels; iii) smaller learning rate leads to a slower saturation of the angular velocity; iv) when the learning rate is low enough such that the algorithm can converge to the optimum, the angular velocity saturates at $90$ degrees.  
\end{itemize}

\begin{figure*}[t]
\vspace{-0.1in}
\centering
\includegraphics[width=.3\textwidth]{Figures/constant_lrs/cifar10_angle_velocities_s=1.jpg}
\includegraphics[width=.3\textwidth]{Figures/constant_lrs/cifar10_train_losses_s=1.jpg}
\includegraphics[width=.3\textwidth]{Figures/constant_lrs/cifar10_test_losses_s=1.jpg}
\vspace{-0.14in}
\caption{The behavior of the loss and angular velocity for an exemplary DL problem (training ResNet-18 on CIFAR-10). An optimizer is run with different settings of the learning rate $\alpha=[0.3, 0.1, 0.03, 0.01, 0.003]$. Angular velocity is calculated over a single epoch.}
\label{fig:DeepME}
\vspace{-0.2in}
\end{figure*}

% \begin{figure}[t]
% % \vspace{-0.2in}
% \centering
% \includegraphics[trim={0.25cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Loss_main3.jpg}
% % \hspace{-0.05in}
% \includegraphics[trim={0.2cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Angle_main3.jpg}
% % \vspace{-0.25in}
% \caption{The behavior of the loss and angular velocity for the noisy quadratic model. An optimizer is using an automatic drop of the learning rate guided by the saturation of the angular velocity. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM2}
% % \vspace{-0.15in}
% \end{figure}

These empirical properties can be theoretically justified as shown in the next theorem. Note that we discuss the bound for the cosine value of angular velocity in Theorem~\ref{thm_1} because the mapping between the cosine value and the angle is a bijection and the cosine value is more amenable for the quantitative analysis.
% \begin{figure}[t]
% % \vspace{-0.15in}
% \centering
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Loss_A3_main.jpg}
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Angle_A3_main.jpg}
% % \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM}
% % \vspace{-0.1in}
% \end{figure}

\vspace{-0.05in}
\begin{theorem}\label{thm_1}
Let the $i$-th diagonal terms of matrices $A$ and $\Sigma$ in the noisy quadratic model be given as $a_i$ and $\sigma_i$, respectively. Then, the expected inner product $<s_{t},s_{t+1}>$ converges to
\vspace{-0.15in}
\begin{align}
    I^*=\lim_{t\to\infty}\mathbb{E}[<s_{t},s_{t+1}>]=-\alpha^3\sum\nolimits_{i=1}^n\frac{a_i^3\sigma_i^2}{2-\alpha a_i}.
\end{align}
\vspace{-0.3in}

Moreover, the cosine value of an angle between two consecutive steps $\cos\angle (s_t,s_{t+1})$ satisfies
% \begin{align}
%     C^*=\lim_{t\to\infty}\mathbb{E}[\cos\angle (s_t,s_{t+1})]\gtrapprox -\frac{\alpha\max_i a_i}{2}.
% \end{align}
\vspace{-0.17in}
\begin{align}
      C^*=&\lim_{t\to\infty}\mathbb{E}[cos(\angle(s_t,s_{t+1}))]\approx-\frac{\alpha}{2}\frac{\sum_{i=1}^n\frac{a_i^3\sigma_i^2}{2-\alpha a_i}}{\sum_{i=1}^n\frac{a_i^2\sigma_i^2}{2-\alpha a_i}}\\
      &\geq-\frac{\alpha\max_i a_i}{2}
  \end{align}
  \vspace{-0.3in}

 $C^*\in [-\frac{1}{2}, 0]$ and thus $\lim_{t\to\infty}\angle (s_t,s_{t+1})\in[90^{\circ},120^{\circ}]$.
\end{theorem}
% \begin{figure}[t]
% % \vspace{-0.15in}
% \centering
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Loss_A3_main.jpg}
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Angle_A3_main.jpg}
% \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM}
% % \vspace{-0.1in}
% \end{figure}

\vspace{-0.1in}
Theorem~\ref{thm_1} implies that as training proceeds, the angular velocity eventually saturates as stated in property P1.
Theorem~\ref{thm_1} furthermore shows that decreasing the learning rate causes the angle between $s_{t}$ and $s_{t+1}$ to converge to a smaller value. Also, from Theorem \ref{thm_1}, $I^*=\lim_{t\to\infty}\mathbb{E}[<s_{t},s_{t+1}>]=-\sum_{i=1}^n(\alpha a_i)^3\sigma_i^2\left[\frac{1}{2-\alpha a_i}\right]$. When $\alpha a_i(i=1,..,n)$ is small enough, $I^*$ can be treated as $0$ which implies that $s_t$ is orthogonal to $s_{t+1}$. In other words, the angle between $s_{t}$ and $s_{t+1}$ converges to $90$ degrees for a small enough learning rate. Otherwise, for larger learning rates, this angle saturates above $90$ degrees. Furthermore, the limit of cosine angle $C^*$ is approximately larger than $-\frac{1}{2}$, thus the saturation level of angular velocity should be below $120$ degrees. This together supports property P2 (in particular this supports points i),ii), and iv); point iii) remains an empirical observation).


We next empirically verified whether these observations carry over to non-convex DL setting on a simple experiment reported in Figure~\ref{fig:DeepME}. When it comes to the stochastic optimization methods for deep learning methods, the empirical loss at each iteration is an unbiased estimation of true objective loss with some variance. Therefore, computing the angular velocity defined above with the parameter at each iteration would suffer the problem of variance exploration. To solve this problem, we update Definition \ref{def:av} into Definition \ref{def:av2} by adding a sliding window with size $k$ and using the mean of parameters in the window for the computation of the parameter $x_i$.

\vspace{-0.05in}
\begin{definition}[Batch angular velocity]
\label{def:av2}
Define the angular velocity of model parameters as:
\vspace{-0.1in}
\begin{equation}
\omega_i = \angle(s_i,s_{i-1}), \:\:\:\text{where}\:\:\:s_i = x_{i+1} - x_i
\label{eq:defAV}
\end{equation}
\vspace{-0.35in}

and $x_i$ is the mean of the parameter vector in $[ik, (i+1)k)$ iterations, where $k$ is the size of the sliding window. The operator $\angle(\cdot, \cdot)$ calculates the angle between two vectors and is defined as:
\vspace{-0.1in}
\begin{equation}
\angle(s_i,s_{i-1}) = \frac{180^{\circ}}{\pi} \cdot \arccos\left(\frac{s_i^\intercal s_{i-1}}{||s_i|| ||s_{i-1}|| + \epsilon}\right),
\label{eq:DefAG}
\end{equation}
\vspace{-0.3in}
\end{definition}
\vspace{-0.05in}

For analysis simplicity, we take the window size $k$ as the number of iterations in one epoch. Clearly, property P1 holds, whereas property P2 is satisfied partially. In particular conclusion iii) is broken as the angular velocity may not reach $90$ degrees. 
% Also, in a DL setting one can observe that for lower learning rates the angular velocity curves become more noisy at saturation, which was not the case for a noisy quadratic model. 

% \begin{figure*}[t]
% % \vspace{-0.1in}
% \centering
% \includegraphics[width=.325\textwidth]{Figures/constant_lrs/cifar10_angle_velocities_s=1.jpg}
% \includegraphics[width=.325\textwidth]{Figures/constant_lrs/cifar10_train_losses_s=1.jpg}
% \includegraphics[width=.325\textwidth]{Figures/constant_lrs/cifar10_test_losses_s=1.jpg}
% % \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for an exemplary DL problem (training ResNet-18 on CIFAR-10). An optimizer is run with different settings of the learning rate $\alpha=[0.3, 0.1, 0.03, 0.01, 0.003]$. Angular velocity is calculated over a single epoch.}
% \label{fig:DeepME}
% % \vspace{-0.1in}
% \end{figure*}


Property P1 is a key observation underlying our algorithm. The saturation of the angular velocity can potentially guide the drop of the learning rate of the optimization algorithm. In other words, given the lower-bound on the learning rate, each time the angular velocity saturates, the learning algorithm should decrease the learning rate. Tracking the saturation of the angular velocity is more plausible than tracking the saturation of the loss function since, as can be clearly seen in Figure~\ref{fig:NQM}, angular velocity curves follow much harder saturation pattern. Also, the loss function does not necessarily need to have a bounded range, as opposed to the angular velocity.
We describe the Algorithm based on property P1 in Section~\ref{sec:Alg}. Property P2 is crucial for the theoretical analysis provided in Section~\ref{sec:Theory}.
% We found that property P1 is sufficient to design an optimization algorithm for training DL models. The algorithm is described in Section~\ref{sec:Alg}. Property P2 is crucial for the theoretical analysis provided in Section~\ref{sec:Theory}.

% \begin{figure}[t]
% % \vspace{-0.15in}
% \centering
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Loss_A3_main.jpg}
% \includegraphics[trim={0 0 0 1.3cm},clip,width=.23\textwidth]{Figures/Angle_A3_main.jpg}
% % \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for noisy quadratic model. An optimizer is run with different settings of the learning rate $\alpha=[0.06, 0.03, 0.01, 0,001]$. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM}
% \vspace{-0.25in}
% \end{figure}
% \begin{figure}[t]
% % \vspace{-0.2in}
% \centering
% \includegraphics[trim={0.25cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Loss_main3.jpg}
% % \hspace{-0.05in}
% \includegraphics[trim={0.2cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Angle_main3.jpg}
% \vspace{-0.14in}
% \caption{The behavior of the loss and angular velocity for the noisy quadratic model. An optimizer is using an automatic drop of the learning rate guided by the saturation of the angular velocity. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM2}
% \vspace{-0.22in}
% \end{figure}


Before moving on to the algorithmic design, we will briefly explain the mechanism that justifies the difference in the behavior between noise quadratic model (NQM) and DL model. The reason DL model does not approach $90$ degrees saturation level that instead the NQM can achieve is that the loss surface for NQM is quadratic convex and DL models instead have a highly non-convex loss surfaces, which makes it very difficult to find the global optimum with loss $0$. However, note that the saturation levels for the DL model, similarly to NQM, still adhere to the range $[90, 120]$.

Following the above intuition, we implement a simple algorithm for optimizing the noisy quadratic model. The algorithm drops the learning rate by a factor of $2$ when the angular velocity saturates (i.e.:, the change of the angular velocity averaged across $20$ iterations is smaller than $0.01$ degree between $2$ consecutive iterations). The initial learning rate was set to $0.06$ and the minimal one was set to $0.001$. Figure~\ref{fig:NQM2} captures the results. It shows that the algorithm that is using the angular velocity to guide the drop of the learning rate indeed converges to the optimum.
% \begin{figure}[t]
% % \vspace{-0.2in}
% \centering
% \includegraphics[trim={0.25cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Loss_main3.jpg}
% % \hspace{-0.05in}
% \includegraphics[trim={0.2cm 0 0.3cm 1cm},clip,width=0.23\textwidth]{Figures/Angle_main3.jpg}
% % \vspace{-0.25in}
% \caption{The behavior of the loss and angular velocity for the noisy quadratic model. An optimizer is using an automatic drop of the learning rate guided by the saturation of the angular velocity. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM2}
% % \vspace{-0.15in}
% \end{figure}
% \begin{wrapfigure}{r}{0.7\textwidth} 
% \vspace{-0.2in}
% \centering
% \includegraphics[trim={0.25cm 0 0.3cm 1cm},clip,width=0.35\textwidth]{Figures/Loss_main3.jpg}
% \hspace{-0.05in}\includegraphics[trim={0.2cm 0 0.3cm 1cm},clip,width=0.35\textwidth]{Figures/Angle_main3.jpg}
% \vspace{-0.25in}
% \caption{The behavior of the loss and angular velocity for the noisy quadratic model. An optimizer is using an automatic drop of the learning rate guided by the saturation of the angular velocity. Angular velocity is averaged over $20$ iterations.}
% \label{fig:NQM2}
% \vspace{-0.15in}
% \end{wrapfigure}
The aforementioned simple algorithm led us to derive the method for optimizing DL models using automatic learning rate drop that we refer to as AutoDrop. The obtained method is a straightforward extension of the above algorithm and is described in the next section. The extension accommodates the fundamental difference that we observed between noisy quadratic model and the DL model: the fact that in the case of DL models, lower learning rates lead to a larger noise of the angular velocity at saturation.

\vspace{-0.1in}
\section{Algorithm}\label{sec:Alg}
\vspace{-0.12in}

\begin{algorithm}[t]
\caption{AutoDrop} 
\floatname{algorithm}{Algorithm}
\label{alg:AD}
\begin{algorithmic}
\REQUIRE 
\STATE $\alpha_0$ and $\underline{\alpha}$: initial learning rate of the optimizer and its lower bound\\
$\rho$: learning rate drop factor\\
$x_0$ : initial model parameter vector \\$Gau(\cdot;\sigma,$m$)$: gaussian filter with smoothing factor $\sigma$, buffer size for smoothing $m$\\
$k$ : sliding window size for computing the batch angular velocity.
\STATE \vspace{-0.13in}
\STATE $\alpha \leftarrow \alpha_0$, $s_0 \leftarrow 0$, $t \leftarrow 0$; $i\gets0$
\STATE $\mathcal{B}\leftarrow\{\}$\:\:\:\textsf{//Create angular velocity buffer}
\FOR{t=1,...,T}
  \STATE Update the parameter $x_t$ with learning rate $\alpha$
  \IF{$t \text{ mod } k = 0$}
    \STATE $y_i\leftarrow\frac{1}{k}\sum_{t=i*k}^{(i+1)*k-1}x_t$; $\omega_i\leftarrow\angle(y_i,y_{i-1})$;
    \STATE$\mathcal{B}=\mathcal{B}\cup\{w_i\}$
    \IF{$len(\mathcal{B})>=10$}
        \STATE $\sigma=min(std(\mathcal{B}),m/2)$
        \STATE $\mathcal{C}_i=Gau(\mathcal{B};\sigma,m)$\:\:\:\textsf{//Smooth angular velocity with Gaussian filter}
        \STATE Drop the first element in buffer $\mathcal{B}$.
    \ENDIF
    \IF{$\mathcal{C}_i-\mathcal{C}_{i-1}<0.1$}
        \STATE $\alpha \leftarrow \max\{\underline{\alpha}, \rho \times \alpha\}$\:\:\:\textsf{//Drop $\alpha$}
        \STATE $\mathcal{B}\leftarrow\{\}$
    \ENDIF
    \STATE $i\gets i+1$
  \ENDIF
\ENDFOR
\end{algorithmic}
\end{algorithm}

In this section, we formulate an automatic learning rate schedule algorithm, AutoDrop (Algorithm \ref{alg:AD}) based on the properties of angular velocity stated in Section \ref{sec:ME}. 
The motivation for our method is to drop the learning rate every time the angular velocity saturates. Even though the behavior of angular velocity is much more general compared with the loss (Figure \ref{fig:DeepME}), the angular velocity is still fluctuating with variance regarding the choice of different learning rates, which makes setting a hard threshold challenging. We introduce a Gaussian filter to smooth the angular velocity:

\vspace{-0.3in}
\begin{align*}
    K(x^*,x_t;\sigma)=\exp{\left((x^*-x_t)^2/2\sigma^2\right)},
\end{align*}
\vspace{-0.3in}

where $\sigma$ is the standard deviation of the Gaussian distribution. We define the width of the smoothing buffer as $m$ and denote the buffer as $\mathcal{B}_t=\{x_{t+i}\}_{i=-m/2}^{m/2}$ then the smoothed angular velocity is 

\vspace{-0.35in}
\begin{align*}
    y_t=Gau(\mathcal{B};\sigma,m):=\frac{1}{Z(t)}\sum\nolimits_{i=-m/2}^{m/2}x_tK(x_t,x_{t+i};\sigma),
\end{align*}
\vspace{-0.2in}

where $Z(t)=\sum_{i=-m/2}^{m/2}K(x_t,x_{t+i})$. The Gaussian smooth factor at each step $\sigma$ is automatically defined 
with the standard deviation of the current buffer $\mathcal{B}_t$. When the variance of the angular velocity with the current buffer is large, it implies that the angular velocity requires a sharp smooth. Regarding the $\sigma$-rule in statistics (nearly 70\% values lie within one standard deviation of the mean), we set an upper bound $\overline{\sigma}=m/2$ for the Gaussian smoothing factor $\sigma$ to avoid too-aggressive smoothing.

The algorithm admits on its input the initial model parameter vector $x_0$, the initial learning rate $\alpha_0$, the value of the smallest permissible learning rate $\underline{\alpha}$, the sliding window size $k$ for computing the batch angular velocity defined in Definition \ref{def:av2}, the width $m$ of the buffer $\mathcal{B}_t$ used for smoothing of angular velocity and learning rate drop factor $\rho$ ($\rho \in (0,1)$; each time the learning rate is dropped, it is multiplied by $\rho$).
The algorithm triggers the procedure for dropping the learning rate (i.e., multiplied by $\rho$) each time the Gaussian smoothed angular velocity changes by less than the threshold. 

The rationale behind dropping the learning rate is not so much to directly accelerate convergence, but rather to help the optimizer that is stuck in the local optimum to escape it. Dropping the learning rate helps DL models escape from current optimum, and finally converge to a better quality one. Note that popular manual learning rate methods (linear learning rate,  stepwise learning rate, cosine annealing learning rate, exponential learning rate, etc.) are all decreasing the learning rate using different mechanisms. Our mechanism 
is based on the angular velocity. We observed that the saturation of the angular velocity can potentially guide the drop of the learning rate of the optimization algorithm since it is a direct indicator that the optimizer is slowing down, or in other words that the loss function is entering saturation, or in other words that the optimizer is getting stuck in the local optimum. Tracking the saturation of the angular velocity is more plausible than tracking the saturation of the loss function for many reasons (see Figure~\ref{fig:NQM} and~\ref{fig:DeepME}): i) angular velocity curves follow much harder saturation pattern, ii) the loss function does not necessarily need to have a bounded range, as opposed to the angular velocity, iii) the angular velocity typically enters saturation slightly earlier than the loss function so tracking the angular velocity enables detecting the moment when the optimizer starts to get stuck in local optimum earlier. 

Detailed pseudo-code for AutoDrop could be found in Algorithm \ref{alg:AD}. We further comment on the two fixed conditions $len(\mathcal{B})>10$ and $\mathcal{C}_i-\mathcal{C}_{i-1}<0.1$ in the algorithm. The condition $len(\mathcal{B})>10$ means that we will not smooth the angular velocity at the very beginning of the training or right after dropping the learning rate - so this is just a common-sense initial condition since we need to gather a few samples before applying smoothing makes sense. Regarding the condition on $\mathcal{C}_i-\mathcal{C}_{i-1}<0.1$. Intuitively the threshold for that term should be set to match the standard deviation of the angular velocity. We found that this standard deviation is between 0.1 and 0.25 (see exemplary Table \ref{tab:ablation_c} in Supplementary for the ResNet experiment with different learning rates; we observed similar properties for the remaining experiments). 

% \vspace{-0.1in}
% \begin{algorithm}[t]
% \caption{AutoDrop} 
% \floatname{algorithm}{Algorithm}
% \label{alg:AD}
% \begin{algorithmic}
% \REQUIRE 
% \STATE $\alpha_0$ and $\underline{\alpha}$: initial learning rate of the optimizer and its lower bound\\
% $\rho$: learning rate drop factor\\
% $x_0$ : initial model parameter vector \\$Gau(\cdot;\sigma,$m$)$: gaussian filter with smoothing factor $\sigma$, buffer size for smoothing $m$\\
% $k$ : sliding window size for computing the batch angular velocity.
% \STATE
% \STATE $\alpha \leftarrow \alpha_0$, $s_0 \leftarrow 0$, $t \leftarrow 0$; $i\gets0$
% \STATE $\mathcal{B}\leftarrow\{\}$\:\:\:\textsf{//Create angular velocity buffer}
% \FOR{t=1,...,T}
%   \STATE Update the parameter $x_t$ with learning rate $\alpha$
%   \IF{$t \text{ mod } k = 0$}
%     \STATE $y_i\leftarrow\frac{1}{k}\sum_{t=i*k}^{(i+1)*k-1}x_t$; $\omega_i\leftarrow\angle(y_i,y_{i-1})$;
%     \STATE$\mathcal{B}=\mathcal{B}\cup\{w_i\}$
%     \IF{$len(\mathcal{B})>=10$}
%         \STATE $\sigma=min(std(\mathcal{B}),m/2)$
%         \STATE $\mathcal{C}_i=Gau(\mathcal{B};\sigma,m)$\:\:\:\textsf{//Smooth angular velocity with Gaussian filter}
%         \STATE Drop the first element in buffer $\mathcal{B}$.
%     \ENDIF
%     \IF{$\mathcal{C}_i-\mathcal{C}_{i-1}<0.1$}
%         \STATE $\alpha \leftarrow \max\{\underline{\alpha}, \rho \times \alpha\}$\:\:\:\textsf{//Drop $\alpha$}
%         \STATE $\mathcal{B}\leftarrow\{\}$
%     \ENDIF
%     \STATE $i\gets i+1$
%   \ENDIF
% \ENDFOR
% \end{algorithmic}
% \end{algorithm}
% The computational complexity for AutoDrop is the same as for other automatic learning rate scheduling techniques. For example, the HD method requires computing the inner product of two consecutive gradients 
% $\langle g_t, g_{t-1}\rangle$; 
% the TLR method requires computing the inner product of two consecutive gradients $\langle g_t, g_{t-1}\rangle$ and the inner product $\langle g_t, g_{t}-g_{t-1}\rangle$. All we need for AutoDrop is to compute the angle between two consecutive steps and perform smoothing of the angular velocity. Compared to vanilla methods, AutoDrop does not require any extra training time.

AutoDrop algorithm can be thought of as a meta-scheme that can be put on top of any optimization method for training deep learning models. Thus one can use any optimizer to update model parameters. Next we discuss hyper-parameters used in AutoDrop.
% In practice we recommend using the following setting of the hyperparameters for our algorithm: $\underline{\alpha} = 0.0001$, $k=64$, $m=10$, and $\rho=0.95$. This setting was used in all our experiments (i.e., we were not adjusting this setting across data sets or architectures). As will be shown in the experimental section, this set of hyperparameters guarantees good performance for a wide range of model architectures and data sets.

\vspace{-0.2in}
\subsection{Hyper-parameters of AutoDrop}
\label{sec:HS}
\vspace{-0.1in}
% \textcolor{red}{Here we should have supporting discussion, table, and ablation study.}\\

Our method is not hyper-parameter free. note that phrase “automatic” in the paper refers to the techniques that do not need manual adjustments of the learning rate during the optimization process. Other automatic learning rate schedulers that we compare with (TLR and HD) also have hyper-parameters, as well as all manual learning rate techniques. We want to emphasize however that in case of AutoDrop, we keep the hyper-parameters fixed across different experiments, as opposed to for example HD method, and we report ablation studies justifying the settings of the hyper-parameters that we use. Finally, TLR also does not require hyper-parameters to be changed across different experiments, but their performance is inferior to AutoDrop (as will be demonstrated experimentally), and furthermore they perform no ablation studies of their hyper-parameters.

This section discusses the setting of all additional hyperparameters, over standard optimizers, that AutoDrop introduces: the learning rate drop factor $\rho$, the buffer size $m$ for Gaussian smoothing, and the window size $k$ for computing the batch angular velocity. 

Hyperparameters $\rho$ and $m$ are set fixed across all our experiments ($\rho=0.95$, $m=10$) and we discuss them first. Note that we also present ablation studies concerning them in the Supplement (Section \ref{supp:hyper}). To ensure that the learning rate does not drop too quickly, $\rho$ should not be too small. Similarly, since excessively large buffer sizes $m$ for Gaussian smoothing leads to over-smoothing and reduced performance,
$m$ should not be set to a large value. $\rho=0.95$, $m=10$ performed the best in our ablation study on CIFAR10/CIFAR100 tasks. As shown in Section \ref{supp:hyper}, only extreme cases where $\rho$ or $m$ are set to very high values  ($\rho=0.99$, $m=50$) result in significant changes in the error. In a wide range of settings of these two hyper-parameters we found that the changes of the model performance are not very large, i.e., of the order $2.5\%-4\%$.
% Finally, note that in a wide range of settings of these two hyper-parameters we found that the changes of the model performance are mild, i.e., of the order $2.5\%-4\%$.

Regarding the sliding window size $k$ used for computing the batch angular velocity, it varies with respect to the size of the training data $N$. Since $k$ decides the frequency of computing the batch angular velocity and we drop the learning rate every time the angular velocity saturates, the learning rate $\alpha_t$ at iteration $t$ for AutoDrop could be simplistically expressed as $\alpha_t=\alpha_0\rho^{\mathcal{O}(N/k)}$, assuming $\rho$ and $m$ are fixed. Therefore, when the size of the data set $N$ is large, e.g., ImageNet data set has $\sim$1.2M images, the sliding window $k$ should be larger than for smaller data sets, such as CIFAR10 and CIFAR100 tasks that have $\sim$10K data points. We found that $k=64$ performs well for CIFAR10 and CIFAR100 tasks, while $k=640$ performs much better for ImageNet. See Table~\ref{tab:ablation_k} for the ablation study.

\begin{table}[H]
\vspace{-0.1in}
\begin{small}
  \begin{tabular}{|p{1.5cm}||p{1.1cm}|p{1.2cm}|p{1.1cm}|p{1.1cm}|}
    \hline
    \multirow{2}{*}{Model} &
    \multicolumn{4}{c|}{Window size $k$}\\
    \cline{2-5}
    & $k$=32 & $k$=64 & $k$=128 & $k$=256 \\
    \hline
    \tabincell{l}{ResNet18 \\ CIFAR10} &$5.65_{\pm
.15}$&$\!\mathbf{4.79_{\pm
.99}}$&$6.08_{\pm
.11}$&$7.41_{\pm
.24}$\\
    \hline
    \tabincell{l}{WRN28x10 \\ CIFAR10}&$4.30_{\pm
.13}$&$\!\mathbf{3.73_{\pm
.07}}$&$5.77_{\pm
.13}$&$7.36_{\pm
.15}$\\
    \hline
    \tabincell{l}{ResNet34 \\ CIFAR100}&$24.07_{\pm
.44}$&$\!\mathbf{21.82_{\pm
.14}}$&$23.11_{\pm
1.3}$&$28.33_{\pm
.20}$ \\
    \hline
    \tabincell{l}{WRN40x10 \\ CIFAR100}&$20.39_{\pm.08}$&$\!\mathbf{19.41_{\pm.10}}$&$24.49_{\pm.16}$&$28.79_{\pm.32}$\\
    \hline
    \hline
    Model&$k$=64 & $k$=256 & $k$=512 & $k$=640 \\
    \hline
    \tabincell{l}{ResNet18 \\ ImageNet}&39.22&\!31.04&29.70&\textbf{29.24} \\
    \hline
  \end{tabular}
  
  \vspace{-0.1in}
  \caption{Ablation study for $k$ conducted across different DL models and data sets.}
  \label{tab:ablation_k}
  \end{small}
  \vspace{-0.1in}
\end{table}


% \vspace{-0.25in}
\section{Theory}
\label{sec:Theory}

\vspace{-0.1in}
\makeatletter\def\@captype{figure}\makeatother

\begin{figure*}
\vspace{-0.3in}
\begin{minipage}{.35\textwidth}
\centering
\begin{figure}[H]
    \centering
    \vspace{0.1in}
    \includegraphics[trim={1cm 3cm 1cm 3cm},clip,width=\textwidth]{Figures/theory/angle_landscape_final.png}
    \vspace{-0.15in}
    \caption{Angular velocity model for a fixed learning rate $\alpha$.}
    \label{fig:flat}
\end{figure}
\vfill
\end{minipage}
\makeatletter\def\@captype{table}\makeatother
\begin{minipage}{.64\textwidth}
\centering
\begin{algorithm}[H]
    \centering
    \caption{AutoDrop (approximate)}\label{alg:LRdrop}
    \begin{algorithmic}
    \STATE \textbf{Inputs:} $x_0$: initial weight \\ 
    \STATE \textbf{Hyperparameters:} $\{\hat{\alpha}_i\}$: set of learning rates, $v_{\alpha}(t)$: ang. vel. model, $\tau_0$: init. threshold for the derivative of ang. vel. \\ 
    \STATE Initialize $i=0$, $t_0=0$, $t=0$\\
    \WHILE{$i<n$}
        \STATE Update $x_t$ via (\ref{eq:sgdm}) with learning rate $\alpha_t\!=\!\hat{\alpha}_i$.
        \IF{$v'_{\hat{\alpha}_i}(t-t_i)\leq\tau_i=\min\{\tau_0, \gamma \hat{\alpha}_i/2\}$}
          \STATE $i=i+1; t_i=t$
        \ENDIF
        \STATE $t=t+1, T=t$
    \ENDWHILE
    \RETURN $\{x_t\}_{t=0}^{T-1}$ (T: $\#$ iterations)
    \end{algorithmic}
\end{algorithm}
\end{minipage}
% \vspace{-0.3in}
\end{figure*}

This section theoretically shows that decreasing the learning rate when the angular velocity saturates guarantees the sub-linear convergence rate for SGD and SGD momentum. Moreover, Section \ref{sec:cov_dis} develops a general convergence proof technique that not only supports AutoDrop, but is also applicable to any learning rate schedulers that decrease the learning rate step-wisely.

% \vspace{-0.15in}
\subsection{Unified convergence analysis with discrete learning rate drop}\label{sec:cov_dis}
% \vspace{-0.15in}
Firstly, we present a unified theoretical framework that covers the update rule of both SGD and momentum SGD. We refer to these update rules jointly as Unified Momentum (UM) method  \citep{yang2016unified}:

\vspace{-0.2in}
\begin{equation}
\text{UM}:\quad\left\{
\begin{aligned}\label{eq:sgdm}
      y_{t+1}&=x_t-\alpha_t\mathcal{G}(x_t;\xi_t)\\
      y_{t+1}^s&=x_t-s\alpha_t\mathcal{G}(x_t;\xi_t)\\
      x_{t+1}&=y_{t+1}+\beta(y_{t+1}^s-y_t^s)
\end{aligned}
\right.    
\end{equation}
\vspace{-0.18in}

where $t$ is the iteration index, $\beta$ is the momentum parameter, $\alpha_t$ is the learning rate at time $t$, $x_t$ is the parameter vector at time $t$, and $\mathcal{G}(x_t;\xi_t)$ is the gradient of the loss function at time $t$ computed for a data mini-batch $\xi_t$. $s$ is the factor that controls the type of optimization method. When $s=0$ and $s=1$, UM method is deduced to the heavy-ball and Nestrov (NAG) methods respectively. When $s=1/(1-\beta)$, UM method is the vanilla gradient descent method.

% \vspace{-0.05in}
% \begin{itemize}[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt, leftmargin=0.3in]
%     \item $s=0$ Heavy-Ball (HB) method:
%     \vspace{-0.05in}
%     $$\text{HB:}\quad x_{t+1}=x_t-\alpha_t\mathcal{G}(x_t;\xi_t)+\beta(x_t-x_{t-1})$$
%     \vspace{-0.1in}
        
%     \item $s=1$ Nestrov (NAG) method:
%     \vspace{-0.1in}
%     $$y_{t+1}=x_t-\alpha_t\mathcal{G}(x_t;\xi_t),\quad x_{t+1}=y_{t+1}+\beta(y_{t+1}-y_t)$$
%     % $$\text{NAG:}\quad \quad\left\{
%     % \begin{aligned}  
%     %       y_{t+1}&=x_t-\alpha_t\mathcal{G}(x_t;\xi_t)\\
%     %       x_{t+1}&=y_{t+1}+\beta(y_{t+1}-y_t)
%     % \end{aligned}
%     % \right.$$
%     \vspace{-0.05in}
     
%     \item $s=1/(1-\beta)$ Gradient Descent (GD) method:
%     \vspace{-0.1in}
%     $$\text{GD:}\quad x_{t+1}=x_t-\alpha_t/(1-\beta)\mathcal{G}(x_t;\xi_t).$$
%     \vspace{-0.1in}
% \end{itemize}
% \vspace{-0.05in}

% The state-of-the-art convergence analysis for common machine learning optimizers only supports constant learning rate \citep{le2012stochastic, yang2016unified, schmidt2017minimizing, ramezani2018stability, zhang2019fast} or continuous learning rate drop schemes \citep{wu2018wngrad, wu2019global, gower2019sgd}. However, the learning rate is dropped in a discrete fashion in many practical cases, especially in DL. Theorem~\ref{thm:sgdm_conv} provides a theoretical convergence guarantee for optimization algorithms that use discrete learning rate drop. 

Next we prove the convergence of UM methods (Theorem~\ref{thm:sgdm_conv}). The theorem requires some mild constraints on the drop gap ($k_i$), i.e., number of iterations between two learning rate drops: $i^{\text{th}}$ and $(i+1)^{\text{st}}$. The constraints capture the intuitive argument that extremely lazy changes to the learning rate would bring the scheme close to the constant learning rate method, essentially preventing convergence. Theorem~\ref{thm:sgdm_conv} accommodates learning settings relying on discrete learning rate drops.

% \begin{thm}\label{thm:sgdm_conv}
% Suppose $f(x)$ is a convex function, $\mathbb{E}\left[\norm{ \mathcal{G}(x;\xi)-\mathbb{E}[\mathcal{G}(x;\xi)]}\right]\leq\delta^2$ and $\norm{\partial f(x)}\leq G$ for any $x$ and some non-negative $G$. Given a sequence of decreasing learning rates $\{\hat{\alpha}_i\}_{i=-1}^{n-1}\subset (0,1)$ and a sequence of integers $\{k_i\}_{i=0}^{n-1}\subset \mathbb{N}$ ($n\gg 1$), there exits constants $\kappa_1,\kappa_2$ such that
% \vspace{-0.07in}
% \begin{align}\label{ieq:ki_con}
%     \hat{\alpha}_i\leq (i+2)^{-\frac{2}{3}}, \quad k_i\hat{\alpha}_i\geq \kappa_1(i+2)^{-\frac{1}{3}},\quad k_i\hat{\alpha}_i\hat{\alpha}_{i-1}\leq \kappa_2(i+1)^{-1},\quad \forall i=0,1,...,n-1.
% \end{align}
% \vspace{-0.23in}

% Define a partition $\Pi:0=t_0<t_1<...<t_n=T (T=\sum_{i=0}^{n-1}k_i)$ based on the integer sequence $\{k_i\}_{i=0}^{n-1}$ such that the gap between $t_i$ and $t_{i+1}$ is $k_i$ ($k_i = t_{i+1}-t_i$). Run UM update defined in Equation~\ref{eq:sgdm} for $T$ iterations by setting the learning rate $\alpha_t$ based on a sequence $\{\hat{\alpha}_i\}_{i=-1}^{n-1}$ as
% \vspace{-0.1in}
% \begin{align}
%     \alpha_t=\hat{\alpha}_i,\quad\text{where }t_i\leq t< t_{i+1}.
% \end{align}
% \vspace{-0.23in}

% Then the following holds:
% \vspace{-0.1in}
% \begin{align}
%     \min_{t=0,...,T-1}\{\mathbb{E}[f(x_t)-f(x^*)]\}\leq&\frac{2\beta(f(x_0)-f(x^*))[(n+1)^{\frac{1}{3}}-2^{\frac{1}{3}}]}{2\kappa_1(1-\beta)[(n+1)^{\frac{2}{3}}-2^{\frac{2}{3}}]}+\frac{(1-\beta)\norm{x_0-x^*}^2}{3\kappa_1[(n+1)^{\frac{2}{3}}-2^{\frac{2}{3}}]}\notag\\
%     &+\frac{(2s\beta+1)(G^2+\delta^2)\kappa_2\log n}{3(1-\beta)\kappa_1[(n+1)^{\frac{2}{3}}-2^{\frac{2}{3}}]}.
% \end{align}
% \vspace{-0.23in}
% \end{thm}

% \begin{figure*}[t]
%     \centering
%     \subfigure[\vspace{-0.3in}]
%     {\includegraphics[width=0.3\textwidth]{Figures/theory/angle_landscape_final.png}}
%     \subfigure[]{\includegraphics[width=0.3\textwidth]{Figures/theory/angle_landscape3_final.png}}
%     \subfigure[]{\includegraphics[width=0.3\textwidth]{Figures/theory/Learning_rate_trend2_final.png}}
%     \vspace{-0.1in}
%     \caption{(a) Angular velocity model for a fixed learning rate $\alpha$. (b) The behavior of the angular velocity for Algorithm~\ref{alg:LRdrop}. (c) The behavior of the learning rate for Algorithm~\ref{alg:LRdrop}.}
%     \vspace{-0.1in}
%     \label{fig:flat}
% \end{figure*}

\begin{theorem}\label{thm:sgdm_conv}
Suppose $f(x)$ is a convex function, $\mathbb{E}\left[\norm{ \mathcal{G}(x;\xi)-\mathbb{E}[\mathcal{G}(x;\xi)]}\right]\leq\delta^2$ and $\norm{\partial f(x)}\leq G$ for any $x$ and some non-negative $G$. Given a sequence of decreasing learning rates $\{\hat{\alpha}_i\}_{i=-1}^{n-1}\subset (0,1)$ and a sequence of integers $\{k_i\}_{i=0}^{n-1}\subset \mathbb{N}$ ($n\gg 1$), there exits constants $\kappa_1,\kappa_2$ such that for all $i=0,...,n-1$
\vspace{-0.1in}
\begin{align}\label{ieq:ki_con2}
    \hat{\alpha}_i\!\leq\! (i\!+\!2)^{-1},\: k_i\hat{\alpha}_i\!\geq\! \kappa_1,\: k_i\hat{\alpha}_i\hat{\alpha}_{i-1}\!\leq\! \kappa_2(i\!+\!1)^{-1}.
\end{align}
\vspace{-0.35in}

Define a partition $\Pi:0=t_0<t_1<...<t_n=T (T=\sum_{i=0}^{n-1}k_i)$. Run UM update defined in Equation~\ref{eq:sgdm} for the number of $T$ iterations by setting the learning rate $\alpha_t$ as
% Define a partition $\Pi:0=t_0<t_1<...<t_n=T (T=\sum_{i=0}^{n-1}k_i)$ based on the integer sequence $\{k_i\}_{i=0}^{n-1}$ such that the gap between $t_i$ and $t_{i+1}$ is $k_i$ ($k_i = t_{i+1}-t_i$). Run UM update defined in Equation~\ref{eq:sgdm} for the number of $T$ iterations by setting the learning rate $\alpha_t$ as
% based on a sequence $\{\hat{\alpha}_i\}_{i=-1}^{n-1}$ as
\vspace{-0.1in}
\begin{align}
    \alpha_t=\hat{\alpha}_i,\quad\text{where }t_i\leq t< t_{i+1}.
\end{align}
\vspace{-0.35in}

Then the following holds:
\vspace{-0.1in}
% \begin{align*}
%     &\min_{t=0,...,T\!-\!1}\{\mathbb{E}[f(x_t)\!-\!f(x^*)]\}\\
%     \leq&\quad\frac{\beta(f(x_0)\!-\!f(x^*))[\log(n+1)\!-\!\log 2]}{\kappa_1(1-\beta)n}\notag\\
%     &+\frac{(1\!-\!\beta)\norm{x_0\!-\!x^*}^2}{2\kappa_1n}+\frac{(2s\beta\!+\!1)(G^2\!+\!\delta^2)\kappa_2\log n}{2(1\!-\!\beta)\kappa_1n}.
% \end{align*}
\begin{align*}
    \min_{t=0,...,T\!-\!1}\{\mathbb{E}[f(x_t)\!-\!f(x^*)]\}\leq O\left(\log n/\sqrt{n}\right).
\end{align*}
\vspace{-0.23in}
\end{theorem}
\vspace{-0.15in}
% Note that all proofs for SGD-based methods require the learning rate to decrease continuously~\citep{wu2018wngrad, wu2019global, gower2019sgd,le2012stochastic, yang2016unified, schmidt2017minimizing, ramezani2018stability, zhang2019fast}. On the other hand, we know that SGD does not converge under a constant learning rate. Discrete learning rate policy (as in AutoDrop) covers the space between constant and continuous learning rate decays. It is non-trivial to see that moving away from a continuous learning rate scheme to a step-wise constant scenario will still sustain the rate of convergence the same as in the continuous learning rate schemes. AutoDrop is a discrete learning rate scheduler, which requires new proof techniques compared with the traditional SGD proof scheme. 
% We develop a general proof technique that not only supports AutoDrop, but is also applicable to any learning rate schedulers that decrease the learning rate step-wise. Theorem~\ref{thm:sgdm_conv} is therefore universal and of fundamental importance.
Note that even in the convex case our analysis is highly non-trivial. All proofs for SGD-based methods require the learning rate to decrease continuously ~\citep{wu2018wngrad, wu2019global, gower2019sgd,le2012stochastic, yang2016unified, schmidt2017minimizing, ramezani2018stability, zhang2019fast} On the other hand, SGD does not converge under a constant learning rate. Discrete learning rate policy (as in AutoDrop) covers the space between constant and continuous learning rate decays. It is non-trivial to see that moving away from a continuous learning rate scheme to a step-wise constant scheme will still sustain the rate of convergence the same as in the continuous learning rate techniques. We also show technical conditions capturing the intuitive argument that extremely lazy changes to the learning rate would bring the step-wise constant learning rate scheme close to the constant learning rate method, essentially preventing convergence. AutoDrop is a discrete learning rate scheduler, which requires new proof techniques compared with the traditional SGD proof scheme. We develop a general proof technique that not only supports AutoDrop, but is also applicable to any learning rate schedulers that decrease the learning rate step-wise. Theorem 5 is therefore universal and of fundamental importance. 

In the next section we extend the obtained theorem to our AutoDrop approach.

% \vspace{-0.05in}
\subsection{Convergence Analysis of AutoDrop}\label{subsec:thm_auto_drop}
% \vspace{-0.1in}
For a fixed learning rate $\alpha$, we introduce a simplified mathematical model of the behavior of the angular velocity as a function of iterations. The model is defined below (and depicted in Figure~\ref{fig:flat}): 
\vspace{-0.1in}
% \begin{equation}
%     v_{\alpha}(t)=\frac{\pi}{2}(1+\epsilon\alpha)\left(1-\frac{1}{\gamma\alpha t}\right),
%     \label{eq:AVModel}
% \end{equation}
% \vspace{-0.05in}
\begin{equation}
    v_{\alpha}(t)=\frac{\pi}{2}(1+\epsilon\alpha)\left(1-\frac{1}{\gamma\alpha (t+1/\gamma \alpha)}\right),
    \label{eq:AVModel}
\end{equation}
% \vspace{-0.3in}

where $t$ is the number of iterations, $\epsilon$ and $\gamma$ control the asymptote and curvature of the velocity.

% \vspace{-0.05in}
% \makeatletter\def\@captype{figure}\makeatother

% \begin{figure*}
% \vspace{-0.3in}
% \begin{minipage}{.35\textwidth}
% \centering
% \begin{figure}[H]
%     \centering
%     \vspace{0.1in}
%     \includegraphics[trim={1cm 3cm 1cm 3cm},clip,width=\textwidth]{Figures/theory/angle_landscape_final.png}
%     \vspace{-0.15in}
%     \caption{Angular velocity model for a fixed learning rate $\alpha$.}
%     \label{fig:flat}
% \end{figure}
% \vfill
% \end{minipage}
% \makeatletter\def\@captype{table}\makeatother
% \begin{minipage}{.64\textwidth}
% \centering
% \begin{algorithm}[H]
%     \centering
%     \caption{AutoDrop (approximate)}\label{alg:LRdrop}
%     \begin{algorithmic}
%     \STATE \textbf{Inputs:} $x_0$: initial weight \\ 
%     \STATE \textbf{Hyperparameters:} $\{\hat{\alpha}_i\}$: set of learning rates, $v_{\alpha}(t)$: ang. vel. model, $\tau_0$: init. threshold for the derivative of ang. vel. \\ 
%     \STATE Initialize $i=0$, $t_0=0$, $t=0$\\
%     \WHILE{$i<n$}
%         \STATE Update $x_t$ via (\ref{eq:sgdm}) with learning rate $\alpha_t\!=\!\hat{\alpha}_i$.
%         \IF{$v'_{\hat{\alpha}_i}(t-t_i)\leq\tau_i=\min\{\tau_0, \gamma \hat{\alpha}_i/2\}$}
%           \STATE $i=i+1; t_i=t$
%         \ENDIF
%         \STATE $t=t+1, T=t$
%     \ENDWHILE
%     \RETURN $\{x_t\}_{t=0}^{T-1}$ (T: $\#$ iterations)
%     \end{algorithmic}
% \end{algorithm}
% \end{minipage}
% \vspace{-0.3in}
% \end{figure*}

% \begin{wrapfigure}{r}{0.61\textwidth} 
% \vspace{-0.25in}
% \begin{center}
% \includegraphics[trim={0.3cm 0 0.3cm 2cm},clip,width=.3\textwidth]{Figures/theory/angle_landscape3_final.png}
% \includegraphics[trim={0.45cm 0 0.3cm 2cm},clip,width=.3\textwidth]{Figures/theory/Learning_rate_trend2_final.png}
% \end{center}
% \vspace{-0.25in}
% \caption{The behavior of the angular velocity (\textbf{left}) and the learning rate (\textbf{right}) for Algorithm~\ref{alg:LRdrop}. The derivative threshold $\tau_i\!=\!\min\{\tau_0, \gamma \hat{\alpha}_i/2\}$.}
% \label{fig:AutoLR}
% \vspace{-0.1in}
% \end{wrapfigure}

% \begin{figure*}[t]
%     \centering
%     \subfigure[\vspace{-0.3in}]
%     {\includegraphics[width=0.3\textwidth]{Figures/theory/angle_landscape_final.png}}
%     \subfigure[]{\includegraphics[width=0.3\textwidth]{Figures/theory/angle_landscape3_final.png}}
%     \subfigure[]{\includegraphics[width=0.3\textwidth]{Figures/theory/Learning_rate_trend2_final.png}}
%     \vspace{-0.1in}
%     \caption{(a) Angular velocity model for a fixed learning rate $\alpha$. (b) The behavior of the angular velocity for Algorithm~\ref{alg:LRdrop}. (c) The behavior of the learning rate for Algorithm~\ref{alg:LRdrop}.}
%     \vspace{-0.1in}
%     \label{fig:flat}
% \end{figure*}

$v_\alpha(t)$ saturates in $\frac{\pi}{2}[1+\epsilon\alpha]$ when $t$ goes to infinity. Note that the given model complies with the property P2 empirically observed and described in Section~\ref{sec:ME}: i) if the learning rate is large enough, the angular velocity saturates at a level larger than $\pi/2$ and smaller than $2\pi/3$; ii) as the learning rate decreases, the angular velocity saturates at progressively lower levels; iii) smaller learning rate leads to a slower saturation of angular velocity; iv) when the learning rate is low enough the angular velocity saturates at $\pi/2$. Let's assume an upper-bound $\alpha_{max}$ for the learning rate. Since the limit of the angular velocity should be between $\pi/2$ and $2\pi/3$, the range of factor $\epsilon$ is set to be $(0,\frac{1}{3\alpha_{max}})$. Finally, Equation (\ref{eq:AVModel}) is universal and accommodates any saturation level between 90 and 120 degrees, thus the behavior of the DL model from Figure \ref{fig:DeepME} could very well be represented using this Equation. 

For the purpose of the theoretical analysis, we drop the learning rate every time the derivative of the angular velocity decreases to a threshold $\tau_i$ (Algorithm \ref{alg:LRdrop}) instead of detecting whether the change of the angular velocity is small enough (Algorithm \ref{alg:AD}). Intuitively, when the derivative of the angular velocity is close to zero, we would expect the angular velocity to saturate. The convergence of Algorithm \ref{alg:LRdrop} is an approximate version of Algorithm~\ref{alg:AD}. The behavior of the angular velocity and the learning rate for Algorithm~\ref{alg:LRdrop} is depicted in Figure~\ref{fig:flat2} in Supplementary \ref{supp:flat2}.
% \begin{algorithm}[H]
%     \centering
%     \caption{AutoDrop (approximate)}\label{alg:LRdrop}
%     \begin{algorithmic}
%     \STATE \textbf{Inputs:} $x_0$: initial weight \\ 
%     \STATE \textbf{Hyperparameters:} $\{\hat{\alpha}_i\}$: set of learning rates, $v_{\alpha}(t)$: ang. vel. model, $\tau_0$: init. threshold for the derivative of ang. vel. \\ 
%     \STATE Initialize $i=0$, $t_0=0$, $t=0$\\
%     \WHILE{$i<n$}
%         \STATE Update $x_t$ via (\ref{eq:sgdm}) with learning rate $\alpha_t\!=\!\hat{\alpha}_i$.
%         \IF{$v'_{\hat{\alpha}_i}(t-t_i)\leq\tau_i=\min\{\tau_0, \gamma \hat{\alpha}_i/2\}$}
%           \STATE $i=i+1; t_i=t$
%         \ENDIF
%         \STATE $t=t+1, T=t$
%     \ENDWHILE
%     \RETURN $\{x_t\}_{t=0}^{T-1}$ (T: $\#$ iterations)
%     \end{algorithmic}
% \end{algorithm}\\
% \vspace{-0.2in}
% version of Algorithm~\ref{alg:AD}. The behavior of the angular velocity and the learning rate for Algorithm~\ref{alg:LRdrop} is depicted in Figure~\ref{fig:flat}(b)\&(c).

% \makeatletter\def\@captype{table}\makeatother
% \begin{minipage}{.61\textwidth}
% \centering
% \begin{algorithm}[t]
%     \centering
%     \caption{AutoDrop (approximate)}\label{alg:LRdrop}
%     \begin{algorithmic}
%     \STATE \textbf{Inputs:} $x_0$: initial weight \\ 
%     \STATE \textbf{Hyperparameters:} $\{\hat{\alpha}_i\}$: set of learning rates, $v_{\alpha}(t)$: ang. vel. model, $\tau_0$: init. threshold for the derivative of ang. vel. \\ 
%     \STATE Initialize $i=0$, $t_0=0$, $t=0$\\
%     \WHILE{$i<n$}
%         \STATE Update $x_t$ via (\ref{eq:sgdm}) with learning rate $\alpha_t\!=\!\hat{\alpha}_i$.
%         \IF{$v'_{\hat{\alpha}_i}(t-t_i)\leq\tau_i=\min\{\tau_0, \gamma \hat{\alpha}_i/2\}$}
%           \STATE $i=i+1; t_i=t$
%         \ENDIF
%         \STATE $t=t+1, T=t$
%     \ENDWHILE
%     \RETURN $\{x_t\}_{t=0}^{T-1}$ (T: $\#$ iterations)
%     \end{algorithmic}
% \end{algorithm}\\
% \end{minipage}

\vspace{-0.05in}
\begin{theorem}\label{thm:conv}
Suppose $f(x)$ is a convex function, $\mathbb{E}\left[\norm{ \mathcal{G}(x;\xi)-\mathbb{E}[\mathcal{G}(x;\xi)]}\right]\leq\delta^2$ and $\norm{\partial f(x)}\leq G$ for any $x$ and some non-negative $G$. Given the sequence of the learning rates $\{\hat{\alpha}_i\}_{i=-1}^{n-1}$ such that $\hat{\alpha}_i=(i+1)^{-\frac{2}{3}}$, parameters $\epsilon\in(0,\frac{1}{3\hat{\alpha}_0})$ and $\gamma$ defining the angular velocity model $v_{\alpha}(t)$ (Equation~\ref{eq:AVModel}), and the initial threshold $\tau_0$ ($\tau_0<2$) for the derivative of the angular velocity, the sequence of weights $\{x_t\}_{t=0}^{T-1}$ generated by Algorithm~\ref{alg:LRdrop} satisfies
\vspace{-0.12in}
\begin{align*}
    \min_{t=0,...,T-1}\{\mathbb{E}[f(x_t)-f(x^*)]\}\leq&O\left(\log T/\sqrt{T}\right),
\end{align*}
\vspace{-0.12in}
% \begin{align}
%     \min_{t=0,...,T-1}\{\mathbb{E}[f(x_t)-f(x^*)]\}\leq&\frac{\beta(f(x_0)\!-\!f(x^*))[\log\left(\sqrt{\frac{2T}{\kappa_1}}\!+\!1\right)\!-\!\log 2]}{\kappa_1(1-\beta)\left[\sqrt{\frac{2T}{\kappa_2}}-3\right]}+\frac{(1-\beta)\norm{x_0-x^*}^2}{2\kappa_1\left[\sqrt{\frac{2T}{\kappa_2}}-3\right]}\notag\\
%     &+\frac{(2s\beta+1)(G^2+\delta^2)\kappa_2\log \left(\sqrt{\frac{2T}{\kappa_1}}\right)}{2(1-\beta)\kappa_1\left[\sqrt{\frac{2T}{\kappa_2}}-3\right]}\\
%     =&O\left(\log T/\sqrt{T}\right),
% \end{align}
\vspace{-0.15in}

where $\kappa_1=\frac{\sqrt{\pi}-1}{\gamma}$ and $\kappa_2=\frac{1}{\gamma}\sqrt{2\pi/3\tau_0}$.
\end{theorem}
% \vspace{-0.15in}

Theorem~\ref{thm:conv} obtained by extending Theorem~\ref{thm:sgdm_conv} to the setting accommodating the angular velocity model from Equation~\ref{eq:AVModel} guarantees sub-linear convergence rate of Algorithm~\ref{alg:LRdrop}. 

\vspace{-0.15in}
\section{Experimental Results}
\label{sec:ER}
\vspace{-0.05in}
 % \textcolor{red}{In this section, we compare the performance of our method, AutoDrop, that automatically adjusts the learning rate, with the SOTA optimization approaches for training DL models that instead manually drop the learning rate (HD, TLR, CLR, OneCycle, and ExpLR - correct this sentence since not all these methods are manual and make sure we know which methods these shortcuts refer to)}. 
 
%  \textcolor{blue}{In this section, we compare the performance of our method, AutoDrop, that automatically adjusts the learning rate, with the SOTA optimization approaches for training DL models that manually drop the learning rate, e.g. cyclic learning rate (CLR) \citep{loshchilov2016sgdr}, OneCycle leanring rate (Onecycle) \citep{smith2019super} and exponential learning rate (ExpLR) \citep{li2019exponential}, and automatic drop the learning rate, e.g. hypergradient decent(HD) \citep{gunes2018online} and Trainable learning rate (HD) \citep{retsinas2022trainable}.}
%  We choose the same experimental setup as the competitors \textcolor{blue}{on ResNet18/CIFAR10, ResNet34/CIFAR100, WRN28x10/CIFAR10 and WRN40x10/CIFAR100.}
%  (\textcolor{red}{name data sets and architectures}) and \textcolor{red}{additionally we present experimental results on the the NLP task, and finally we even consider a distributed optimization setting.}
% The baselines that we compare with are SOTA approaches taken from the referenced papers that rely on different variants of SGD.  

% \begin{figure*}[t]
%     \centering
%     % \includegraphics[width=0.24\textwidth]{Figures/resnet18_cifar10_2/lr_all_0.05.jpg}
%     % % \includegraphics[width=0.3\textwidth]{Figures/resnet18_cifar10_2/test_loss_all_0.05.jpg}
%     % \includegraphics[width=0.24\textwidth]{Figures/resnet18_cifar10_2/test_error_all_0.05.jpg}
%     % \includegraphics[width=0.24\textwidth]{Figures/wrn28x10_cifar10_2/lr_all_0.1.jpg}
%     % % \includegraphics[width=0.3\textwidth]{Figures/wrn28x10_cifar10_2/test_loss_all_0.05.jpg}
%     % \includegraphics[width=0.24\textwidth]{Figures/wrn28x10_cifar10_2/test_error_all_0.05.jpg}
%     \subfigure[ResNet18/CIFAR10]{\includegraphics[width=0.23\textwidth]{Figures/resnet18_cifar10_3/lr_all_0.05.jpg}
%     \includegraphics[width=0.23\textwidth]{Figures/resnet18_cifar10_3/test_error_all_0.05.jpg}}
%     \subfigure[WRN28x10/CIFAR10]{\includegraphics[width=0.23\textwidth]{Figures/wrn28x10_cifar10_3/lr_all_0.1.jpg}
%     \includegraphics[width=0.23\textwidth]{Figures/wrn28x10_cifar10_3/test_error_all_0.05.jpg}}
%     \vspace{-0.2in}
%     \caption{The behavior of the test error and learning rate for the experiments with ResNet18 and WRN28x10 models and CIFAR10 data set.}
%     \label{fig:cifar10}
%     \vspace{-0.1in}
% \end{figure*}

\begin{table*}[t]
% \vspace{-0.1in}
\centering
% \vspace{0.1in}
\begin{tabular}{|p{1.7cm}||p{1.5cm}|p{1.5cm}|p{1.5cm}|p{1.5cm}|p{1.5cm}|p{2.3cm}|p{1.8cm}|}
\hline
Model&HD&TLR&CLR&OneCycle&ExpLR&SOTA Baseline& AutoDrop\\
\hline
\tabincell{l}{ResNet18 \\ CIFAR10} & \tabincell{l}{$6.78_{\pm.23}$} & \tabincell{l}{$5.70_{\pm.19}$} &\tabincell{l}{$5.14_{\pm.11}$} &\tabincell{l}{$4.86_{\pm.12}$} &\tabincell{l}{$5.82_{\pm.10}$} &\tabincell{l}{$\mathbf{4.79_{\pm.17}}$$^{\dagger}$} &\tabincell{l}{$\mathbf{4.79_{\pm.99}}$} \\

\hline

\tabincell{l}{WRN28x10 \\ CIFAR10} & $ 9.12\pm$ $.60$ & \tabincell{l}{$ 16.70_{\pm2.2}$} &\tabincell{l}{$ 5.48_{\pm.11}$} &\tabincell{l}{$ 4.78_{\pm.16}$} &\tabincell{l}{$ 6.80_{\pm.15}$} &\tabincell{l}{$ \mathbf{3.77_{\pm.05}}$}$^{\ddagger}$ &\tabincell{l}{$ \mathbf{3.73_{\pm.07}}$} \\

\hline

\tabincell{l}{ResNet34 \\ CIFAR100} & $ 26.89_{\pm1.5}$ & \tabincell{l}{$ 23.91_{\pm.35}$} &\tabincell{l}{$ 22.69_{\pm.30}$} &\tabincell{l}{$ 22.29_{\pm.09}$} &\tabincell{l}{$ 24.29_{\pm.47}$} &\tabincell{l}{$ \mathbf{21.92_{\pm.34}}$$^{\dagger}$} &\tabincell{l}{$\mathbf{21.82_{\pm.14}}$} \\

\hline

\tabincell{l}{WRN40x10 \\ CIFAR100} & \tabincell{l}{$29.32 _{\pm.46}$} & \tabincell{l}{$ 39.54_{\pm.48}$} &\tabincell{l}{$23.61 _{\pm.38}$} &\tabincell{l}{$ 22.60_{\pm.66}$} &\tabincell{l}{$23.32 _{\pm.24}$} &\tabincell{l}{$ \mathbf{18.96_{\pm0.05}}$}$^{\ddagger}$  &\tabincell{l}{$19.41 _{\pm.10}$} \\

\hline

% \tabincell{l}{ResNet18 \\ ImageNet} & 30.43 & 29.81 &30.48 &30.67& &\tabincell{c}{$\mathbf{29.74}$$^{*}$} &\tabincell{c}{$\mathbf{29.246}$}\\
\tabincell{l}{ResNet18 \\ ImageNet} & 30.43 & 29.81&30.48 &30.67& 30.10&$\mathbf{29.74}$$^{*}$&$\mathbf{29.246}$\\
\hline
\tabincell{l}{\textcolor{black}{ResNet50} \\ \textcolor{black}{ImageNet}} & \textcolor{black}{25.35} & \textcolor{black}{26.51}&\textcolor{black}{24.15} &\textcolor{black}{27.84}& \textcolor{black}{24.57}&\textcolor{black}{$\mathbf{23.76}$$^{*}$}&\textcolor{black}{$\mathbf{23.92}$}\\
\hline
\end{tabular}
\vspace{-0.1in}
\caption{Test errors of AutoDrop, SOTA baselines reported in the literature, and baseline manual (CLR, OneCycle, ExpLR) and automatic (HD and TLR) learning rate adjustment algorithms. We run each experiment four times with different random seeds and report the mean and standard deviation of the minimal test error (at the $200^{\text{th}}$ epoch for CIFAR10/CIFAR100 and $100^{\text{th}}$ epoch for ImageNet). $^{\dagger}$ $^{\ddagger}$ and $^{*}$ follows\citep{zhang2019lookahead}, \citep{Zagoruyko2016WRN} and \citep{He2016DeepRL} respectively.}
\vspace{-0.2in}
\label{tab:cifar}
\end{table*}

% \vspace{-0.1in}
In this section, we compare the performance of AutoDrop, that automatically adjusts the learning rate, with the SOTA learning rate schedulers for training DL models on the image classification and NLP tasks.
In the selection of SOTA baselines, we always choose the best performing strategy reported in the literature for a given data set and architecture. Note that the best performing strategy reported by others relies on manual learning rate drop. For vision tasks the best performing strategy is referred to as SOTA Baseline and for NLP tasks, this is either ReduceLR or LinearLR in our tables (the references to relevant papers are provided in the text).

We want emphasize that our goal in this paper is to design the automatic learning rate scheduler that could reach the SOTA performance. We do not intend to outperform the SOTA, but rather show that it is possible to design an automatic learning rate scheduler that indeed can match manual schemes that the SOTA relies on. Our method performance-wise matches or outperforms SOTA approach, as will be demonstrated, and wins with all other learning rate schedulers, manual and automatic. So for example existing automatic learning rate schedulers, HD and TLR, lose with SOTA since they suffer from the short-horizon problem, which we by design do not have. 

% In this section, we compare the performance of our method AutoDrop, that automatically adjusts the learning rate, with the SOTA optimization approaches for training DL models that manually drop the learning rate, e.g. cyclic learning rate (CLR) \citep{loshchilov2016sgdr}, OneCycle leanring rate (Onecycle) \citep{smith2019super} and exponential learning rate (ExpLR) \citep{li2019exponential}, and automatic drop the learning rate, e.g. hypergradient decent(HD) \citep{gunes2018online} and Trainable learning rate (HD) \citep{retsinas2022trainable} for the image classification tasks. Additionally, we present experimental results on the NLP task. Our AutoDrop performs comparable with SOTA manual learning rate schedulers and outperforms SOTA automatic learning rate schedulers.

\vspace{-0.1in}
% \subsection{Hyper-parameters of AutoDrop}
% \label{sec:HS}
% \vspace{-0.1in}
% % \textcolor{red}{Here we should have supporting discussion, table, and ablation study.}\\
% This section explains the settings of all additional hyperparameters, over standard SGD, that AutoDrop introduces: the window size $k$ for computing the batch angular velocity, the learning rate drop factor $\rho$, and the buffer size $m$ for Gaussian smoothing. Note that \textit{these hyper-parameters are kept fixed across all our experiments, thus they are never tuned}. Here, we present ablation studies concerning these parameters. We experimented with the window size $k$ in the range of $[16, 32, 64, 128, 256]$. 
% % The drop factor $\rho$ between 0 and 1 controls the dropping rate of AutoDrop. 
% To ensure that the learning rate does not drop too quickly, $\rho$ could not be too small and we chose values from $[0.5, 0.8, 0.9, 0.95, 0.99]$. 
% Since excessively large buffer sizes $m$ for Gaussian smoothing leads to over-smoothing and reduced performance,
% we select values in the range of $[5,10,20,30,50]$. 
% The ablation study is captured in Figure \ref{fig:ablation} in the Supplementary material~\ref{supp:hyper} and allows us to determine the best setting: $k=64$, $\rho=0.95$, and $m=10$. In a wide range of hyper-parameter settings that we explore, the changes of the model performance are of the order $2.5\%-4\%$, which is mild and attest to little sensitivity to the actual choices of these hyper-parameters.
% \begin{figure}[t]
%     \centering
%     \vspace{-0.1in}
%     \includegraphics[width=0.3\textwidth]{Figures/ablation/ablation_k.jpg}
%     \includegraphics[width=0.3\textwidth]{Figures/ablation/ablation_m.jpg}
%     \includegraphics[width=0.3\textwidth]{Figures/ablation/ablation_rho.jpg}
%     \vspace{-0.1in}
%     \caption{Ablation study for hyper-parameters $k$, $m$, and $\rho$. ResNet18 model and CIFAR10 data set.}
%     \label{fig:ablation}
%     \vspace{-0.2in}
% \end{figure}

\vspace{-0.05in}
\subsection{Image Classification}
\label{sec:ED}
\vspace{-0.1in}
\textbf{The CIFAR-$\mathbf{10}$} and \textbf{CIFAR-$\mathbf{100}$} data sets\citep{cifar} consist of $50$ K training images, with $10$ and $100$ different classes respectively. For CIFAR-$10$ experiments we used a ResNet-$18$ \citep{He2016DeepRL} and a WRN-$28$x$10$ \citep{Zagoruyko2016WRN} models. For CIFAR-$100$ experiments we used a ResNet-$34$ \citep{He2016DeepRL} and a WRN-$40$x$10$ \citep{Zagoruyko2016WRN} models. We do not use the dropout \citep{srivastava2014dropout} layers for WRN models in our experiments since this led to better performance. The implementation involving WRN architecture and CIFAR data set relies on publicly available codes\footnote{https://github.com/meliketoy/wide-resnet.pytorch}. For the above experiments, we refer to \citep{zhang2019lookahead} and \citep{Zagoruyko2016WRN} for ResNet and WRN models respectively. The ImageNet (ILSVRC-$\mathbf{2012}$) data set \citep{imagenet_cvpr09} consists of $1.2$ M images divided into $1$ K categories. We train a ResNet-$18$ and a ResNet-$50$\citep{He2016DeepRL} model on this data set. We use official model implementation from PyTorch \footnote{https://pytorch.org/vision/stable/models.html}.
\\
\vspace{-0.15in}
\\
In our experiments, for the SOTA baseline (the method achieving the best performance on the given data set and model, as reported in the literature) we use the same setting of hyperparameters (including the learning rate schedule) as recommended in the referenced literature. For CLR~\citep{smith2017cyclical} we test with \textit{triangular2} learning policies by adjusting the $stepsize$ (the number of iterations in half a cycle) for different models as recommended by the authors and \textit{OneCycle} policy with only one triangular cycle. For ExpLR \citep{li2019exponential}, we grid search the decay factor from $\gamma=[0.8, 0.9, 0.95, 0.99, 0.999]$. For HD \citep{gunes2018online} we grid search the hypergradient learning rate $\beta$ from $[10^{-3},10^{-4},10^{-5}]$ as suggested in the reference paper. For TLR \citep{retsinas2022trainable} we set gap $p$ for updating the learning rate as $0.33$ epoch and bound $c\!=\!1/4$, as recommended by the authors. For AutoDrop, we fixed $\rho\!=\!0.95$ and $m\!=\!10$, and searched $k$ for the best one as described in Section \ref{sec:HS}.
% \begin{figure*}[t]
%     \centering
%     \vspace{-0.1in}
%     \includegraphics[width=0.3\textwidth]{Figures/tran_wmt14/lr_all.jpg}
%     \includegraphics[width=0.3\textwidth]{Figures/tran_wmt14/val_loss_all.jpg}\includegraphics[width=0.3\textwidth]{Figures/tran_wmt14/test_bleu_score_all.jpg}
%     \vspace{-0.1in}
%     \caption{Experimental curves for Transformer and WMT14 data set: learning rate, validation loss, test BLEU score, and zoomed subplots.}
%     \label{fig:tran_wmt14}
%     \vspace{-0.2in}
% \end{figure*}

% \begin{table}[htbp]
% % \vspace{-0.2in}
% \centering
% % \vspace{0.1in}
% \begin{tabular}{|p{1.8cm}||p{2cm}|p{2.2cm}|}
% \hline
% Model&Method &Test Error [\%]\\
% \hline
% \multirow{7}{8em}{ResNet-$18$ CIFAR-$10$}
% &HD &$6.78\pm0.225$ \\
% &TLR &$5.70\pm0.193$ \\
% &CLR &$5.14\pm0.105$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$4.860 \pm 0.123$} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$5.820 \pm 0.093$} \\
% &Baseline$^{\dagger}$  &$\mathbf{4.79\pm0.169}$\\
% &AutoDrop &$\mathbf{4.79\pm0.099}$\\
% \hline

% \multirow{7}{8em}{WRN-$28$x$10$ CIFAR-$10$}
% &HD &$9.12 \pm0.60$ \\
% &TLR &$16.70\pm2.20$ \\
% &CLR &$5.48 \pm0.113$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$ \pm $} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$\pm$} \\
% &Baseline$^{\ddagger}$ & $\mathbf{3.77 \pm 0.05}$\\
% &AutoDrop & $\mathbf{3.73\pm0.07}$\\
% \hline

% \multirow{7}{8em}{ResNet-$34$ CIFAR-$100$}
% &HD  &$26.89\pm1.57$ \\
% &TLR  &$23.91\pm0.35$ \\
% &CLR   &$22.69\pm0.30$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$ \pm $} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$\pm$} \\
% &Baseline$^{\dagger}$ & $\mathbf{21.92\pm0.34}$ \\
% &AutoDrop  &  $\mathbf{21.82\pm0.14}$\\
% \hline

% \multirow{7}{8em}{WRN-$40$x$10$ CIFAR-$100$}
% &HD &$29.32\pm0.46$ \\
% &TLR &$39.54\pm0.48$ \\
% &CLR &$23.61\pm0.38$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$ \pm $} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$\pm$} \\
% &Baseline$^{\ddagger}$ & $\mathbf{18.96 \pm 0.052}$\\
% &AutoDrop & $19.41\pm0.10$ \\
% \hline 
% \end{tabular}
% \caption{Test errors of AutoDrop, baselines reported in the literature, and SOTA manual (CLR) and automatic (HD and TLR) learning rate adjustment algorithms. For CIFAR-$10$ and CIFAR-$100$ we ran each experiment four times with different random seeds. We report the mean and standard deviation of the final test error (at the $200^{\text{th}}$ epoch). $^{\dagger}$ follows the the setup of \citep{zhang2019lookahead}. $^{\ddagger}$ follows the the setup of \citep{Zagoruyko2016WRN}.}
% % \vspace{-0.2in}
% \label{tab:cifar}
% \end{table}

% \begin{figure*}[t]
%     \centering
%     % \includegraphics[width=0.24\textwidth]{Figures/resnet18_cifar10_2/lr_all_0.05.jpg}
%     % % \includegraphics[width=0.3\textwidth]{Figures/resnet18_cifar10_2/test_loss_all_0.05.jpg}
%     % \includegraphics[width=0.24\textwidth]{Figures/resnet18_cifar10_2/test_error_all_0.05.jpg}
%     % \includegraphics[width=0.24\textwidth]{Figures/wrn28x10_cifar10_2/lr_all_0.1.jpg}
%     % % \includegraphics[width=0.3\textwidth]{Figures/wrn28x10_cifar10_2/test_loss_all_0.05.jpg}
%     % \includegraphics[width=0.24\textwidth]{Figures/wrn28x10_cifar10_2/test_error_all_0.05.jpg}
%     \subfigure[ResNet18/CIFAR10]{\includegraphics[width=0.24\textwidth]{Figures/resnet18_cifar10_3/lr_all_0.05.jpg}
%     \includegraphics[width=0.24\textwidth]{Figures/resnet18_cifar10_3/test_error_all_0.05.jpg}}
%     \subfigure[WRN28x10/CIFAR10]{\includegraphics[width=0.24\textwidth]{Figures/wrn28x10_cifar10_3/lr_all_0.1.jpg}
%     \includegraphics[width=0.24\textwidth]{Figures/wrn28x10_cifar10_3/test_error_all_0.05.jpg}}
%     \vspace{-0.2in}
%     \caption{The behavior of the test error and learning rate for the experiments with ResNet18 and WRN28x10 models and CIFAR10 data set.}
%     \label{fig:cifar10}
%     \vspace{-0.2in}
% \end{figure*}
% \begin{table*}[t]
% % \vspace{-0.1in}
% \centering
% % \vspace{0.1in}
% \begin{tabular}{|p{1.8cm}||p{1.7cm}|p{1.7cm}|p{1.7cm}|p{1.7cm}|p{1.7cm}|p{1.8cm}|p{1.8cm}|}
% \hline
% Model&HD&TLR&CLR&OneCycle&ExpLR&Baseline$^{\dagger}$& AutoDrop\\
% \hline
% \tabincell{l}{ResNet18 \\ CIFAR10} & \tabincell{l}{$6.78\pm$ $0.23$} & \tabincell{l}{$5.70\pm$ $  0.19$} &\tabincell{l}{$5.14\pm$ $  0.11$} &\tabincell{l}{$4.86\pm$ $  0.12$} &\tabincell{l}{$5.82\pm$ $  0.10$} &\tabincell{l}{$\mathbf{4.79\pm}$ $  \mathbf{0.17}$} &\tabincell{l}{$\mathbf{4.79\pm}$ $  \mathbf{0.99}$} \\

% \hline

% \tabincell{l}{WRN28x10 \\ CIFAR10} & $ 9.12\pm$ $ 0.60$ & \tabincell{l}{$ 16.70\pm$ $ 2.20$} &\tabincell{l}{$ 5.48\pm$ $ 0.11$} &\tabincell{l}{$ 4.78\pm$ $ 0.16$} &\tabincell{l}{$ 6.80\pm$ $ 0.15$} &\tabincell{l}{$ \mathbf{3.77\pm}$ $ \mathbf{0.05}$} &\tabincell{l}{$ \mathbf{3.73\pm}$ $ \mathbf{0.07}$} \\

% \hline

% \tabincell{l}{ResNet34 \\ CIFAR100} & $ 26.89\pm1.57$ & \tabincell{l}{$ 23.91\pm0.35$} &\tabincell{l}{$ 22.69\pm0.30$} &\tabincell{l}{$ 22.29\pm0.09$} &\tabincell{l}{$ 24.29\pm0.47$} &\tabincell{l}{$ \mathbf{21.92\pm}\mathbf{0.34}$} &\tabincell{l}{$\mathbf{21.82 \pm}\mathbf{0.14}$} \\

% \hline

% \tabincell{l}{WRN40x10 \\ CIFAR100} & \tabincell{l}{$29.32 \pm0.46$} & \tabincell{l}{$ 39.54\pm0.48$} &\tabincell{l}{$23.61 \pm0.38$} &\tabincell{l}{$ 22.60\pm0.66$} &\tabincell{l}{$23.32 \pm0.24$} &\tabincell{l}{$ \mathbf{18.96\pm}\mathbf{0.05}$} &\tabincell{l}{$19.41 \pm0.10$} \\

% \hline

% \tabincell{l}{ResNet18 \\ ImageNet} & 30.43 & 29.81 &30.48 &30.67& &\tabincell{c}{$\mathbf{29.74}$} &\tabincell{c}{$\mathbf{29.246}$}\\

% \hline
% \end{tabular}
% \vspace{-0.1in}
% \caption{Test errors of AutoDrop, baselines reported in the literature, and SOTA manual (CLR, OneCycle, ExpLR) and automatic (HD and TLR) learning rate adjustment algorithms. We run each experiment four times with different random seeds and report the mean and standard deviation of the minimal test error (at the $200^{\text{th}}$ epoch). $^{\dagger}$ and $^{\ddagger}$ follows\citep{zhang2019lookahead} and \citep{Zagoruyko2016WRN}, respectively.}
% \vspace{-0.2in}
% \label{tab:cifar}
% \end{table*}

% Table \ref{tab:cifar} shows the final test error performance obtained on CIFAR-$10$, CIFAR-$100$ and ImageNet datasets and Figure \ref{fig:cifar10} shows the behavior of the test error and learning rate with epochs (the behavior of the train and test errors/losses and learning rate with epochs for all our experiments is deferred to the Supplement, Section~\ref{supp:image}). 

Table \ref{tab:cifar} shows the final test error performance obtained on CIFAR-$10$, CIFAR-$100$ and ImageNet datasets and the behavior of the train and test errors/losses and learning rate with epochs for all our experiments is deferred to the Supplement, Section~\ref{supp:image}.
Our method shows comparable performance in terms of the test error compared to the manually-tuned SOTA Baseline approaches while automatically selecting the iterations for dropping the learning rate. Simultaneously, AutoDrop was shown superior to manual (CLR, OneCycle, and ExpLR) and automatic (HD and TLR) learning rate adjustment algorithms that were all unable to match the performance of the SOTA baseline.
% All figures related to Table \ref{tab:cifar} are deferred to the Supplementary material \ref{supp:image}.
% Across all the experiments on CIFAR data sets, AutoDrop run with the learning drop factor $\rho = 0.95$ and $m=10$ (the size of buffer used for Gaussian smoothing of angular velocity), was always among the winning AutoDrop strategies.  Our method does not require extra hyperparameter tuning since all parameters introduced from our paper are fixed across different experiments.
% \begin{table}[t]
% \vspace{-0.05in}
% \centering
% \begin{small}
% \begin{tabular}{|p{1.4cm}||p{0.8cm}|p{0.8cm}|p{1.2cm}|p{1.2cm}|}
% \hline
% \tabincell{l}{Model$\backslash$Opt}&HD&TLR&SOTA Baseline&AutoDrop\\
% \hline
% \tabincell{l}{WRN28x10\\CIFAR10}&0.21s&0.23s&0.20s&0.20s\\
% \hline
% \tabincell{l}{WRN40x10\\CIFAR100}&0.31s&0.31s&0.29s&0.30s\\
% \hline
% \tabincell{l}{ResNet50\\ImageNet}&0.42s&0.43s&0.38s&0.40s\\
% \hline
% \end{tabular}
% \vspace{-0.1in}
% \caption{Computational time for a single iteration of HD, TLR, SOTA Baseline, and AutoDrop.}
% \vspace{-0.25in}
% \label{tab:time}
% \end{small}
% \end{table}
% \vspace{-0.5in}

In Table \ref{tab:time}, we also show the the computational time for a single iteration of HD, TLR, SOTA Baseline, and AutoDrop run on the same machine (NVIDIA GeForce GTX 1080 Ti) for different models on different data sets. We use the same batch size of 64 for all methods to have a fair comparison. As you can see the training time per-iteration is practically the same for all methods. Therefore our method does not introduce any additional significant extra computations compared to the existing optimization methods. 
\vspace{-0.15in}
\begin{table}[H]
\centering
\begin{small}
\begin{tabular}{|p{1.4cm}||p{0.8cm}|p{0.8cm}|p{1.2cm}|p{1.2cm}|}
\hline
\tabincell{l}{Model$\backslash$Opt}&HD&TLR&SOTA Baseline&AutoDrop\\
\hline
\tabincell{l}{WRN28x10\\CIFAR10}&0.21s&0.23s&0.20s&0.20s\\
\hline
\tabincell{l}{WRN40x10\\CIFAR100}&0.31s&0.31s&0.29s&0.30s\\
\hline
\tabincell{l}{ResNet50\\ImageNet}&0.42s&0.43s&0.38s&0.40s\\
\hline
\end{tabular}
\vspace{-0.1in}
\caption{Computational time for a single iteration of HD, TLR, SOTA Baseline, and AutoDrop.}
\vspace{-0.2in}
\label{tab:time}
\end{small}
\end{table}
% \vspace{-0.5in}

Finally, regarding convergence of the methods, note that the theoretical convergence of our method is shown in the paper and the rate in theory matches traditional optimizers, such as SGD. The convergence curves are deferred to the Supplement (Section \ref{sec:ED}). The curves reveal that AutoDrop converges to SOTA performance, unlike other methods. Furthermore, looking at the test error for different methods at different epochs $(50, 100, 150, 200)$ for the exemplary ResNet18/CIFAR10 task (see Table \ref{tab:convergence_epoch} in the Supplement) reveals that AutoDrop reaches comparable performance as SOTA Baseline with sightly faster convergence rate that others cannot attain.


% \begin{table*}[t]
% \vspace{-0.1in}
% \centering
% % \vspace{0.1in}
% \begin{tabular}{|p{1.8cm}||p{1.7cm}|p{1.7cm}|p{1.7cm}|p{1.7cm}|p{1.7cm}|p{1.8cm}|p{1.8cm}|}
% \hline
% Model&HD&TLR&CLR&OneCycle&ExpLR&Baseline$^{\dagger}$& AutoDrop\\
% \hline
% \tabincell{l}{ResNet18 \\ CIFAR10} & \tabincell{l}{$6.78\pm$ $0.23$} & \tabincell{l}{$5.70\pm$ $  0.19$} &\tabincell{l}{$5.14\pm$ $  0.11$} &\tabincell{l}{$4.86\pm$ $  0.12$} &\tabincell{l}{$5.82\pm$ $  0.10$} &\tabincell{l}{$\mathbf{4.79\pm}$ $  \mathbf{0.17}$} &\tabincell{l}{$\mathbf{4.79\pm}$ $  \mathbf{0.99}$} \\

% \hline

% \tabincell{l}{WRN28x10 \\ CIFAR10} & $ 9.12\pm$ $ 0.60$ & \tabincell{l}{$ 16.70\pm$ $ 2.20$} &\tabincell{l}{$ 5.48\pm$ $ 0.11$} &\tabincell{l}{$ 4.78\pm$ $ 0.16$} &\tabincell{l}{$ 6.80\pm$ $ 0.15$} &\tabincell{l}{$ \mathbf{3.77\pm}$ $ \mathbf{0.05}$} &\tabincell{l}{$ \mathbf{3.73\pm}$ $ \mathbf{0.07}$} \\

% \hline

% \tabincell{l}{ResNet34 \\ CIFAR100} & $ 26.89\pm1.57$ & \tabincell{l}{$ 23.91\pm0.35$} &\tabincell{l}{$ 22.69\pm0.30$} &\tabincell{l}{$ 22.29\pm0.09$} &\tabincell{l}{$ 24.29\pm0.47$} &\tabincell{l}{$ \mathbf{21.92\pm}\mathbf{0.34}$} &\tabincell{l}{$\mathbf{21.82 \pm}\mathbf{0.14}$} \\

% \hline

% \tabincell{l}{WRN40x10 \\ CIFAR100} & \tabincell{l}{$29.32 \pm0.46$} & \tabincell{l}{$ 39.54\pm0.48$} &\tabincell{l}{$23.61 \pm0.38$} &\tabincell{l}{$ 22.60\pm0.66$} &\tabincell{l}{$23.32 \pm0.24$} &\tabincell{l}{$ \mathbf{18.96\pm}\mathbf{0.05}$} &\tabincell{l}{$19.41 \pm0.10$} \\

% \hline

% \tabincell{l}{ResNet18 \\ ImageNet} & 30.43 & 29.81 &30.48 &30.67& &\tabincell{c}{$\mathbf{29.74}$} &\tabincell{c}{$\mathbf{29.246}$}\\

% \hline
% \end{tabular}
% \vspace{-0.1in}
% \caption{Test errors of AutoDrop, baselines reported in the literature, and SOTA manual (CLR, OneCycle, ExpLR) and automatic (HD and TLR) learning rate adjustment algorithms. We run each experiment four times with different random seeds and report the mean and standard deviation of the minimal test error (at the $200^{\text{th}}$ epoch). $^{\dagger}$ and $^{\ddagger}$ follows\citep{zhang2019lookahead} and \citep{Zagoruyko2016WRN}, respectively.}
% % \vspace{-0.2in}
% \label{tab:cifar}
% \end{table*}


% \begin{table}[H]
% % \vspace{-0.2in}
% \centering
% % \vspace{0.1in}
% \begin{tabular}{|p{1.8cm}||p{2cm}|p{2.2cm}|}
% \hline
% Model&Method &Test Error [\%]\\
% \hline
% \multirow{7}{8em}{ResNet-$18$ CIFAR-$10$}
% &HD &$6.78\pm0.225$ \\
% &TLR &$5.70\pm0.193$ \\
% &CLR &$5.14\pm0.105$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$4.860 \pm 0.123$} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$5.820 \pm 0.093$} \\
% &Baseline$^{\dagger}$  &$\mathbf{4.79\pm0.169}$\\
% &AutoDrop &$\mathbf{4.79\pm0.099}$\\
% \hline

% \multirow{7}{8em}{WRN-$28$x$10$ CIFAR-$10$}
% &HD &$9.12 \pm0.60$ \\
% &TLR &$16.70\pm2.20$ \\
% &CLR &$5.48 \pm0.113$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$ 4.783\pm 0.158$} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$6.807\pm0.151$} \\
% &Baseline$^{\ddagger}$ & $\mathbf{3.77 \pm 0.05}$\\
% &AutoDrop & $\mathbf{3.73\pm0.07}$\\
% \hline

% \multirow{7}{8em}{ResNet-$34$ CIFAR-$100$}
% &HD  &$26.89\pm1.57$ \\
% &TLR  &$23.91\pm0.35$ \\
% &CLR   &$22.69\pm0.30$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$22.287\pm 0.09$} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$24.290\pm0.466$} \\
% &Baseline$^{\dagger}$ & $\mathbf{21.92\pm0.34}$ \\
% &AutoDrop  &  $\mathbf{21.82\pm0.14}$\\
% \hline

% \multirow{7}{8em}{WRN-$40$x$10$ CIFAR-$100$}
% &HD &$29.32\pm0.46$ \\
% &TLR &$39.54\pm0.48$ \\
% &CLR &$23.61\pm0.38$ \\
% &\textcolor{blue}{OneCycle} &\textcolor{blue}{$ 22.595\pm 0.655$} \\
% &\textcolor{blue}{ExpLR} &\textcolor{blue}{$23.320\pm0.240$} \\
% &Baseline$^{\ddagger}$ & $\mathbf{18.96 \pm 0.052}$\\
% &AutoDrop & $19.41\pm0.10$ \\
% \hline 
% \end{tabular}
% \caption{Test errors of AutoDrop, baselines reported in the literature, and SOTA manual (CLR) and automatic (HD and TLR) learning rate adjustment algorithms. For CIFAR-$10$ and CIFAR-$100$ we ran each experiment four times with different random seeds. We report the mean and standard deviation of the final test error (at the $200^{\text{th}}$ epoch). $^{\dagger}$ follows the the setup of \citep{zhang2019lookahead}. $^{\ddagger}$ follows the the setup of \citep{Zagoruyko2016WRN}.}
% % \vspace{-0.2in}
% \label{tab:cifar}
% \end{table}

\vspace{-0.1in}
\subsection{NLP tasks}
\vspace{-0.05in}
% \textcolor{red}{There are missing citations below.}

\textbf{Machine Translation.} A transformer model based on \citep{vaswani2017attention} was trained to translate German to English on the WMT2014 data set \citep{bojar2014findings}, using ADAM \citep{kingma2015} optimizer. The performance of our AutoDrop is compared with ReduceLROnPlateau \citep{ReduceLROnPlateau}, HD, and TLR. We train the model for 10K iterations. Table~\ref{tab:mt} displays the BLEU score obtained on the test data set. The proposed optimizer led to the highest score on the machine translation task. Figure \ref{fig:tran_wmt14} in the Supplementary material \ref{supp:mt} displays the training curve and shows that AutoDrop also converges faster.
% \vspace{-0.2in}
\begin{table}[H]
\centering
\vspace{-0.1in}
\begin{tabular}{|p{1.4cm}||p{.8cm}|p{.8cm}|p{1.4cm}|p{1.4cm}|}
\hline
Model&HD&TLR&ReduceLR&AutoDrop\\
\hline
\tabincell{l}{Trans \\ WMT14} & $19.07$ & $19.48$ &$19.96$ & $\mathbf{20.37}$\\
\hline
\end{tabular}
\vspace{-0.1in}
\caption{BLUE score of AutoDrop, manual (ReduceLROnPlateau) learning rate and automatic (HD and TLR) learning rate adjustment algorithms on transformer model for WMT2014 data set.}
% \vspace{-0.3in}
\label{tab:mt}
\end{table}

\textbf{GLUE Benchmark.} We apply the large language model BERT\citep{devlin2018bert} on the GLUE\citep{wang2018glue} benchmark data set, using ADAM \citep{kingma2015} optimizer. As is commonly known, the initial increase of the learning rate during training, which is also known as the “warm-up” phase, plays an important role in the training of large language model. For the GLUE benchmark, we run all methods with and without warm-up and choose the best performer. We compare our Autodrop with the manual learning rate methods: constant learning rate method (ConstLR) and linear learning rate method (LinearLR), and automatic learning rates schemes: TLR and HD. ConstLR is keeping the learning rate constant and LinearLR is reducing it linearly during the training process. And in particular, for AutoDrop and linear and constant learning rate schedulers, adding warm-up improved performance. For the others (HD and TLR), the performance was deteriorated. For constant/linear learning rate, we grid search the learning rate/the peak of learning rate $\alpha$ from $[1e-7,1e-6,1e-5]$ and choose the best performer. In Table \ref{tab:GLUE} for each method the best performance is reported. AutoDrop performs much better than automatic learning rate schedulers (HD and TLR) and achieves comparable performance to manual learning rate schedulers (linear and constant learning rate methods).

\vspace{-0.15in}
\begin{table}[H]
   \small
    \centering
    \begin{tabular}{|p{0.8cm}||p{0.6cm}|p{0.6cm}|p{1.1cm}|p{1.1cm}|p{1.1cm}|}
    \hline
    GLUE&HD & TLR & LinearLR & ConstLR & AutoDrop \\
    \hline
    CoLA&80.44&78.90&82.07&\textbf{83.41}&82.83\\
    \hline
    MNLI&78.92&81.43&83.71&83.21&\textbf{83.76}\\
    \hline
    QNLI&90.46&91.17&91.54&91.32&\textbf{91.74}\\
    \hline
    QQP&86.42&87.33&\textbf{90.51}&90.48&90.04\\
    \hline
    % RTE&61.01&60.64&66.06&65.70&\textbf{69.31}\\
    % \hline
    SST-2&91.51&91.49&92.66&91.97&\textbf{92.74}\\
    \hline
    % STS-B&30.13&&\textbf{39.33}&36.93&38.13\\
    % \hline
    
  \end{tabular}
  \vspace{-0.05in}
  \caption{BLUE score on GLUE benchmark for BERT.}
  \label{tab:GLUE}
\end{table}

% \begin{table}[t]
% % \vspace{-0.1in}
% \centering
% % \vspace{0.1in}
% \begin{tabular}{|p{1.4cm}||p{.8cm}|p{.8cm}|p{1.4cm}|p{1.4cm}|}
% \hline
% Model&HD&TLR&ReduceLR&AutoDrop\\
% \hline
% \tabincell{l}{Trans \\ WMT14} & $19.07$ & $19.48$ &$19.96$ & $\mathbf{20.37}$\\
% \hline
% \end{tabular}
% \vspace{-0.05in}
% \caption{Test errors of AutoDrop, manual (ReduceLROnPlateau) learning rate and automatic (HD and TLR) learning rate adjustment algorithms on transformer model for WMT2014 data set.}
% % \vspace{-0.2in}
% \label{tab:mt}
% \end{table}

% \begin{table}[H]
% \vspace{-0.1in}
% \centering
% % \vspace{0.1in}
% \begin{tabular}{|p{1.8cm}||p{2cm}|p{2.2cm}|}
% \hline
% Model&Method &Test Bleu Scores [\%]\\
% \hline
% \multirow{4}{8em}{Transformer\\WMT14}
% &HD &$19.07$ \\
% &TLR &$19.48$ \\
% &ReduceLR &$19.96$\\
% &AutoDrop &$\mathbf{20.37}$\\
% \hline
% \end{tabular}
% \caption{Test errors of AutoDrop, baselines reported in the literature, and SOTA manual (CLR) and automatic (HD and TLR) learning rate adjustment algorithms. For CIFAR-$10$ and CIFAR-$100$ we ran each experiment four times with different random seeds. We report the mean and standard deviation of the final test error (at the $200^{\text{th}}$ epoch). $^{\dagger}$ follows the the setup of \citep{zhang2019lookahead}. $^{\ddagger}$ follows the the setup of \citep{Zagoruyko2016WRN}.}
% % \vspace{-0.2in}
% \label{tab:mt}
% \end{table}

% \begin{table}[htbp!]
% % \vspace{-0.2in}
% \centering
% \caption{Test errors of AutoDrop, baselines reported in the literature, and SOTA manual (CLR) and automatic (HD and TLR) learning rate adjustment algorithms. For CIFAR-$10$ and CIFAR-$100$ we ran each experiment four times with different random seeds. We report the mean and standard deviation of the final test error (at the $200^{\text{th}}$ epoch). $^{\dagger}$ follows the the setup of \citep{zhang2019lookahead}. $^{\ddagger}$ follows the the setup of \citep{Zagoruyko2016WRN}.} 
% \begin{tabular}{|p{1.8cm}||p{2cm}|p{2.2cm}|}
% \hline
% Model&Method &Test Error [\%]\\
% \hline
% \multirow{7}{8em}{ResNet-$18$ CIFAR-$10$}
% &HD &$6.78\pm0.225$ \\
% &TLR &$5.70\pm0.193$ \\
% &CLR &$5.14\pm0.105$ \\
% &Baseline$^{\dagger}$  &$\mathbf{4.79\pm0.169}$\\
% &AutoDrop &$\mathbf{4.79\pm0.099}$\\
% \hline

% \multirow{7}{8em}{WRN-$28$x$10$ CIFAR-$10$}
% &HD &$9.12 \pm0.60$ \\
% &TLR &$16.70\pm2.20$ \\
% &CLR &$5.48 \pm0.113$ \\
% &Baseline$^{\ddagger}$ & $\mathbf{3.77 \pm 0.05}$\\
% &AutoDrop & $\mathbf{3.73\pm0.07}$\\
% \hline

% \multirow{7}{8em}{ResNet-$34$ CIFAR-$100$}
% &HD  &$26.89\pm1.57$ \\
% &TLR  &$23.91\pm0.35$ \\
% &CLR   &$22.69\pm0.30$ \\
% &Baseline$^{\dagger}$ & $\mathbf{21.92\pm0.34}$ \\
% &AutoDrop  &  $\mathbf{21.82\pm0.14}$\\
% \hline

% \multirow{7}{8em}{WRN-$40$x$10$ CIFAR-$100$}
% &HD &$29.32\pm0.46$ \\
% &TLR &$39.54\pm0.48$ \\
% &CLR &$23.61\pm0.38$ \\
% &Baseline$^{\ddagger}$ & $\mathbf{18.96 \pm 0.052}$\\
% &AutoDrop & $19.41\pm0.10$ \\
% \hline 

% \end{tabular}
% \label{tab:cifar}
% \end{table}

\vspace{-0.2in}
\section{Conclusion}\label{sec:Con}
\vspace{-0.15in}
This paper addresses the question: how to relieve the laborious task of tuning the learning rate when training DL models? Our work is motivated by a growing need to develop DL optimization techniques that are more automated in order to increase their scalability and improve the accessibility to DL technology by a wider range of participants. The selection of hyperparameters for training DL models, and especially the learning rate scheduling, is a very hard problem and still remains largely unsolved in the literature. We provide a new algorithm, AutoDrop, for adjusting the learning rate drop during the training of DL models that works online and can be run on top of any DL optimization scheme. AutoDrop has a compelling list of features: it is a simple algorithm to implement and use, it is theoretically well-grounded, it compares favorably to a large cohort of different baseline training approaches, and by design it avoids the short-horizon problem. In our future work, we intend to generalize our approach to automatically schedule other hyper-parameters than the learning rate, such as the momentum term.


\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    % Briefly acknowledge people and organizations here.
    % \emph{All} 
    % acknowledgements go in this section.
    The authors acknowledge that the NSF Award $\#$2041872 sponsored the research in this paper. This work was also supported in part by the NYUAD Center for Artificial Intelligence and Robotics, funded by Tamkeen under the NYUAD Research Institute Award CG010. 
\end{acknowledgements}

% References
\bibliography{uai2024-template}

\newpage
\onecolumn
\input{supp.tex}


\end{document}
