
\section{Introduction}

We consider the following sparse optimization problem:
\begin{equation}\label{eq:optimizationproblem}
(\text{P}):  
\quad
\begin{array}{l}
\min f(\mathbf{x},\Xi):=\frac{1}{N}\sum_{i=1}^{N}f^{(i)}(\mathbf{x}, \xi^{(i)}) \\
\text{s.t. }
\mathbf{x} \in C_s
\end{array}
\end{equation}
where $f^{(i)}: \mathbb{R}^n\times \Xi \rightarrow \mathbb{R}$ for $i=1,\dots, N$, $\Xi=\{\xi^{(1)}, \dots, \xi^{(N)}\}$, and $C_s=\{\mathbf{x} \in \mathbb{R}^n \mid \|\mathbf{x}\|_0 \leq s\}$ (sparsity constraint) is the union of finitely many subspaces whose dimension is less than or equal to the sparsity level $s$ such that $1 \leq s<n$. The importance of the Problem (P) is due to the fact that finding a sparse network whose accuracy is on a par with a dense network amounts to solving a
bi-level, constrained, stochastic, nonconvex, and non-smooth sparse optimization problem \cite{damadi2022amenable}. Thus finding efficient algorithms that solve Problem (P) can be beneficial for addressing compression of deep neural networks. 

Among algorithms for solving sparse optimization the Iterative Hard Thresholding (IHT) algorithm has been a very successful one due to the simplicity of its implementation. The IHT algorithm not only has been practically efficient, but also shows theoretical promising results. It was originally devised for solving compressed sensing problems in 2008 \cite{blumensath2008iterative,blumensath2009iterative}. Since then, a large body of literature has been studying it from different perspectives. 
For example, \cite{beck2013sparsity,lu2014iterative, Lu2015OptimizationOS,pan2017convergent,zhou2021global} consider convergence of iterations, \cite{jain2014iterative, liu2020between} study the limit of the objective function value sequence, \cite{liu2017dual,zhu2018lagrange} address duality, \cite{zhou2020subspace, zhao2021lagrange} extend it to Newton's-type IHT, \cite{blumensath2012accelerated,khanna2018iht,vu2019accelerating,wu2020accelerated} address accelerated IHT, and \cite{wang2019fast, bahmani2013greedy} solve logistic regression problem using the IHT algorithm. 
Recently \cite{damadi2022gradient} introduced the concepts of HT-unstable stationary points (saddle points in the sense of sparse optimization) and showed the escapability property of the HT-unstable stationary points as one of the crucial properties of the IHT algorithm. Also, they showed Q-linearly convergence of the IHT algorithm towards strictly HT-stable stationary points. However, these desirable properties, requires to compute the batch (full) gradient at
each iteration which is computationally expensive or impractical with current GPUs.

On the other hand, almost all training for deep neural networks are done using the mini-batch stochastic gradient which is a combination of the stochastic approximation \cite{robbins1951stochastic} implemented by 
the backpropagation algorithm 
\cite{rumelhart1986learning}. By taking the mini-batch stochastic approximation, we consider solving Problem (P) using the mini-batch Stochastic Iterative Hard Thresholding algorithm outlined in Algorithm \ref{alg:siht}. Similar to practice where the mini-batch size is fixed beforehand, we fix the mini-batch size at the beginning which is different from previous work \cite{zhou2018efficient} in this area. Also, for showing our theoretical results we directly use the mini-batch stochastic gradient and derive our theoretical results which is different from previous works \cite{chen2016accelerated, li2016nonconvex} where the batch (full) gradient is used to show the theoretical results. As opposed to other works where restricted strong convexity is necessary for deriving convergence results \cite{liang2020effective, zhou2018efficient}, here the only assumption we make is the restricted strong smoothness on the objective function not on each individual one. Also, we assume that the objective function is a bounded below function which is the case for objective functions used in machine learning applications.
Similar to practice where the mini-batch size is fixed beforehand, we fix the mini-batch size at the beginning which is different from previous works \cite{zhou2018efficient}.  

\subsection*{Summary of Contributions}
By considering the mini-batch SIHT Algorithm \ref{alg:siht} for Problem (P), we develop the following results:
\begin{itemize}
\item
We establish a new critical sparse stochastic gradient descent property of the hard
thresholding (HT) operator that has not been found in the literature.
\item
For a given step-size $0 <\gamma < \frac{1}{L_s}$, we find a lower bound on the size of the mini-batch that guarantees the expected descent of the objective value function after hardthresholding.
\item
Using the sparse stochastic gradient descent property we show that the sequence generated by the mini-batch SIHT algorithm is supermartingale and converges with probability one. 
\item
We show that for a certain class of functions in Problem (P) where $f(\mathbf{x},\xi^{i}):=f^{(i)}(\mathbf{V}_{i\bullet}\mathbf{x})$
$f^{(i)}: \mathbb{R}^n \rightarrow \mathbb{R}$, the sum of norm squared of individual gradients restricted to a set of some elements $\mathcal{J}$, i.e., $\sum_{i=1}^N \|\nabla_{\mathcal{J}} f^{(i)}\|_2^2$, evaluated at every point is proportionate to the norm of the batch gradient $\|\nabla_{\mathcal{J}}
f\|_2^2$ where the proportionality constant only depends on the data. Moreover, dependency of the proportionality constant on the data is restricted to the set of $\mathcal{J}$ not the entire data.
\end{itemize}

\input{siht}

\section{Related work}
In order improve computational efficiency of the IHT algorithm, algorithms based on stochastic hard thresholding try to use the finite-sum structure
of problem (P)
\cite{nguyen2017linear, li2016nonconvex, shen2017tight}. The StoIHT algorithm is introduced in \cite{nguyen2017linear} where at each iteration a random element from the sum in Problem (P) is drawn and the associated gradient is calculated. Basically, the gradient is approximated by a mini-batch stochastic gradient with size one. The StoIHT algorithm defines a sparse subspace and then projects the updated vector into that. To show the theoretical results in \cite{nguyen2017linear}, the restricted strong smoothness condition for each individual function in Problem (P) is required as well as the restricted strong convexity for the objective function. In addition, the StoIHT algorithm needs the restricted condition number
be to 4/3 which is hard to meet in practice.
The stochastic variance reduced gradient hard thresholding (SVRG-HT)
algorithm \cite{li2016nonconvex, shen2017tight} 
tries to mitigate the variance with a cost of calculating the (batch) full gradient at some stages. This information of the batch gradient is the key for reducing the variance. Similar to the StoIHT algorithm, the SVRG-HT algorithm requires the restricted strong smoothness condition for each individual function in Problem (P) as well as the restricted strong convexity for the objective function. The Accelerated Stochastic Block Coordinate
Gradient Descent with Hard Thresholding (ASBCDHT) algorithm in \cite{chen2016accelerated} is a randomized version of the StoIHT algorithm which suffers the drawbacks of the StoIHT algorithm, i.e., calculating the full gradient and requirement of the restricted strong conditions.
The Hybrid Stochastic Gradient Hard Thresholding (HSG-HT) algorithm in \cite{zhou2018efficient} is a variant of stochastic IHT algorithms that uses a mini-batch stochastic gradient at each step. However, from the theoretical perspective, the size of a mini-batch has to increase as the algorithm progresses. This makes the algorithm almost deterministic in calculating the gradient and defeats the purpose of using the mini-batch stochastic gradient.
The stochastically
controlled stochastic gradients (SCSG-HT) algorithm in \cite{liang2020effective}
uses mini-batch stochastic gradients with large batch size as opposed to the SVRG-HT and the ASBCDHT algorithms to reduce the variance with less computation, i.e., not calculating the batch gradient at some steps.
We present the mini-batch stochastic IHT algorithm and show that the stochastic sequence of the function value is a supermartingale sequence and it converges with probability one. To show our result, we assume the objective function has the restricted
strong smoothness property and is bounded below which is the case for objective functions used machine
learning applications. Also, to the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.



\section{Definitions}
We provide some definitions that will be used throughout the paper.
\begin{definition}[Restricted Strong Smoothness (RSS)]\label{def:rss}
A differentiable function $f: \mathbb{R}^n \to \mathbb{R}$ is said to be restricted strongly smooth with modulus $L_s>0$ or is $L_s$-RSS if
\begin{equation}\label{eq:rss}
f(\mathbf{y}) \leq f(\mathbf{x}) + \langle \nabla f(\mathbf{x}) , \mathbf{y}-\mathbf{x} \rangle + \frac{L_{s}}{2}\|\mathbf{y}-\mathbf{x}\|_2^2 \quad \forall \mathbf{x},\mathbf{y} \in \mathbb{R}^n \text{ such that } \|\mathbf{x}\|_0 \leq s,\|\mathbf{y}\|_0\leq s.
\end{equation}
\end{definition}
\begin{definition}
[The HT operator]
\label{def:hardthresholding}
The HT operator $H_s(\cdot)$ denotes the orthogonal projection onto multiple subspaces of $\mathbb{R}^n$ with dimension $1 \leq s<n$, that is,
\begin{equation}\label{eq:hardthreshold}
    H_s(\mathbf{x}) \in \arg\min_{\|\mathbf{z}\|_0\leq s }\|\mathbf{z}-\mathbf{x}\|_2.
\end{equation}
\end{definition}
\begin{claim}\label{claim:tops}
The HT operator keeps the $s$ largest entries of its input in absolute values.
\end{claim}
For a vector $\mathbf{x} \in \mathbb{R}^n$, $\mathcal{I}^{\mathbf{x}}_s \subset \{1,\dots, n\}$ denotes the set of indices corresponding to the first $s$ largest elements of $\mathbf{x}$ in absolute values. For example $H_2([1,-3,1]^{\top})$ is either $[0,-3,1]^{\top}$ or $[1,-3,0]^{\top}$ where $\mathcal{I}^{\mathbf{y}}_2=\{2,3\}$ and $\mathcal{I}^{\mathbf{y}}_2=\{1,2\}$, respectively. Therefore, the output of it may not be unique. This clearly shows why HTO is not a convex operator and why there is an inclusion in (\ref{eq:hardthreshold}) not an inequality.
\begin{definition}[Convergence with probability one]
A random sequence $(\mathbf{x}^k \in \mathbb{R}^n)$  in a sample space $\Omega$
 converges to a random variable $\mathbf{x}^*$
with probability one if 
$$\mathbb{P}\Big[\omega \in \Omega: \displaystyle{\lim_{k \to \infty}}\|\mathbf{x}^k(\omega) - \mathbf{x}^*\|\Big]=0.$$
\end{definition}

\section{Results}
We consider solving Problem (\ref{eq:optimizationproblem}) using the mini-batch SIHT Algorithm \ref{alg:siht} and develop results that guarantee the convergence of the sequence of function values generated by the SIHT Algorithm. To do so, we present our results in two separate subsections. The first part provides stochastic results characterizing expectation of functions involving the sample average of given vectors. Then, in the subsequent subsection we use the aforementioned results to show Theorem \ref{theorem:stochasticdescent} which establishes a stochastic gradient result that is the foundation for the convergence of the function value sequence.
\subsection{Stochastic results for sample average}
In this subsection, we consider a sample average whose elements are drawn uniformly and without replacement.
Then, we prove Lemma \ref{lemma:expectionofsampleaverage} that calculates the expected value of the norm squared of the sample average based on the covariance matrix of a random vector whose elements are Bernoulli random variable determining elements of the sample average. Next, in Corollary \ref{cor:distancetomean} using Lemma \ref{lemma:expectionofsampleaverage} we calculate the expected value of the squared distance between the sample average and the overall average. This result is extended in Theorem \ref{theorem:distancetoeach} where the expected value is calculated so that one is able to find the mentioned expectation based on each individual vector and the overall average.
We start with the following well-known lemma. 

\begin{lemma}[\cite{mathai1992quadratic}]\label{lemma:randomquadratic}
Let $\mathbf{\Lambda} \in  \mathbb{R}^{n \times n}$ be a deterministic matrix and $\bm{\xi} \in \mathbb{R}^n$ be a random vector that is distributed according to some probability distribution $\mathcal{P}$. Then,
\begin{equation}\label{eq:randomquadratic}
\mathbb{E}_{\bm{\xi}}\Big[
\bm{\xi}^{\top} \mathbf{\Lambda} \bm{\xi}
\Big]
=\text{trace}(\mathbf{\Lambda} \text{Cov}(\bm{\xi})) + \mathbb{E}_{\bm{\xi}}^{\top}\Big[\bm{\xi}\Big]\mathbf{\Lambda} \mathbb{E}_{\bm{\xi}}\Big[\bm{\xi}\Big]. 
\end{equation}
\end{lemma}
To invoke the above lemma, notice that one can define a random vector whose elements are Bernoulli random variables determining whether the associated vector is in the sample average or not. Thus we prove the following lemma.
\begin{lemma}\label{lemma:expectionofsampleaverage}
Let $\mathbf{g}^{(1)}, \dots, \mathbf{g}^{(N)} \in \mathbb{R}^n$ be $N$ deterministic vectors and $\text{B} \subseteq \{1, \dots, N\}$ be a random set. Let $\bar{\mathbf{g}}:=\frac{1}{N}\sum_{i=1}^N \mathbf{g}^{(i)}$, $\mathcal{G}(\text{B}):=\frac{1}{|\text{B}|}\sum_{i \in \text{B}} \mathbf{g}^{(i)}$, $\mathbf{G}:=\Big[ \mathbf{g}^{(1)} \quad \dots \quad \mathbf{g}^{(N)}  \Big] \in \mathbb{R}^{n \times N}$, and $\mathbf{z}(\text{B})=[z_1(\text{B}), \dots, z_N(\text{B})]^{\top}$ where $z_i(\text{B})$ is a Bernoulli random variable such that $z_i(\text{B})=1$ if $i \in \text{B}$ otherwise $z_i(\text{B})=0$ for $i=1, \dots, N$. Assume $\mathbb{E}_{\text{B}}\big[ \mathcal{G}(\text{B}) \big]=\bar{\mathbf{g}}$, then for any random set $\text{B}$ with fixed size $|\text{B}|$, the following holds:
\begin{equation}\label{eq:expectionofsampleaverage}
\mathbb{E}_{\text{B}}\big[ \| \mathcal{G}(\text{B})\|^2 \big] = \frac{1}{|\text{B}|^2}\text{trace}\Big(\mathbf{G}^{\top}\mathbf{G} \text{Cov}\big(Z(\text{B})\big)\Big) + \| \bar{\mathbf{g}}\|^2.
\end{equation}
\end{lemma}
Once the above result is established, it is straightforward to show the following by observing the fact that the sample average is an unbiased estimator of the overall average, i.e., $\mathbb{E}_{\text{B}}\big[ \mathcal{G}(\text{B}) \big]=\bar{\mathbf{g}}$.
\begin{corollary}\label{cor:distancetomean}
Assume all the assumptions in Lemma \ref{cor:distancetomean} hold. Then for any random set $\text{B}$ with fixed size $|\text{B}|$, the following holds:
\begin{equation}\label{eq:distancetomean}
\mathbb{E}_{\text{B}}\big[ \| \mathcal{G}(\text{B}) - \bar{\mathbf{g}}\|^2 \big] = \frac{1}{|\text{B}|^2}\text{trace}\Big(\mathbf{G}^{\top}\mathbf{G} \text{Cov}\big(Z(\text{B})\big)\Big)
\end{equation}
\end{corollary}
Finally, we use the above results to prove the following which calculates the expected squared distance between the sample average and the overall average based on individual vectors and the overall average. The following result is critical because later we will see that Equation (\ref{eq:distancetoeach}) connects the mini-batch stochastic gradient, the batch gradient, and individual gradients in Problem (P).
\begin{theorem}\label{theorem:distancetoeach}
Assume all the assumptions in Lemma \ref{cor:distancetomean} hold. If elements of the random set $\text{B}$ are drawn uniformly and without replacement, then
\begin{equation}\label{eq:distancetoeach}
\mathbb{E}_{\text{B}}\big[ \| \mathcal{G}(\text{B}) - \bar{\mathbf{g}}\|^2 \big]
=
\frac{N-|\text{B}|}{|\text{B}|N(N-1)}
\Big(
\sum_{i=1}^N \|\mathbf{g}^{(i)}\|_2^2 - N \|\bar{\mathbf{g}}\|^2\Big)
=
\frac{N-|\text{B}|}{|\text{B}|N}
\frac{1}{N-1}
\sum_{i=1}^N \|\mathbf{g}^{(i)}-\bar{\mathbf{g}}\|_2^2.
\end{equation}
\end{theorem}
\subsection{Stochastic results for Hard Thresholding operator}
The goal of this subsection is to show the random sequence $\big(f(\mathbf{x}^k)_{k \geq 1}\big)$ generated by the mini-batch SIHT algorithm converges with probability one. To show this we prove that the random sequence of the function value is a supermartingale sequence so the expected value of the function value sequence is decreasing. To achieve our goal, we prove the following lemma that provides an upper bound on the function value evaluated at a thresholded vector. Notice that the following result does not require the input be an updated vector by the gradient.

\begin{lemma}\label{lemma:rsswithdelta}
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be in $C^1$ and $L
s$-RSS. Then for a fixed $\mathbf{x} \in C_s$ with any $\mathcal{I}_s^{\mathbf{x}}$, any $0 < \gamma \leq \frac{1}{L_s}$, and any given vector $\mathbf{g} \in \mathbb{R}^n$, either of the following holds for any $\mathbf{y} \in H_s(\mathbf{x}-\gamma \mathbf{g})$ with any $\mathcal{I}_s^{\mathbf{y}}$:
\begin{equation}
f(\mathbf{y}) 
\leq 
f(\mathbf{x}) 
-
\frac{\gamma}{2}(1-L_s\gamma)
\| \mathbf{g}_{\mathcal{I}_s^{\mathbf{y}}}\|^2_2
-
\frac{\gamma}{2}
\| \mathbf{g}_{\mathcal{I}_s^{\mathbf{x}} }\|^2_2
+
\gamma
\langle 
\mathbf{\delta}_{\mathcal{I}_s^{\mathbf{y}}}, \mathbf{g}_{\mathcal{I}_s^{\mathbf{y}}} \rangle
+
\gamma
\langle 
\mathbf{\delta} _{\mathcal{I}\backslash \mathcal{I}_s^{\mathbf{y}}}, \mathbf{x}_{\mathcal{I}\backslash \mathcal{I}_s^{\mathbf{y}}} \rangle \label{eq:rsswithdeltaxandy}
\end{equation}
where $\mathcal{I} = \mathcal{I}_s^{\mathbf{x}} \cup \mathcal{I}_s^{\mathbf{y}}$ and $\mathbf{\delta}=\mathbf{g} - \nabla f(\mathbf{x})$.
\end{lemma}
Observe that in the above lemma the vector $\mathbf{g}$ can be any vector in $\mathbb{R}^n$. It need not be the gradient nor the mini-batch gradient. However, in the following lemma we prove that if $\mathbf{g}$ is designated to be an unbiased stochastic approximation of the gradient at an arbitrary point, then the following result holds.
\begin{lemma}\label{lemma:unbiasedapproximation}
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be in $C^1$ and $L
s$-RSS. Assume $\mathbf{g}(\mathbf{x}, \omega)$ be an unbiased stochastic approximation of the gradient at $\mathbf{x} \in \mathbb{R}^n$ where $\omega \sim D$ for some distribution $D$, i.e., $\mathbb{E}_{\omega}[\mathbf{g}(\mathbf{x}, \omega)]=\nabla f(\mathbf{x})$. Then for a fixed $\mathbf{x} \in C_s$ with any $\mathcal{I}_s^{\mathbf{x}}$ and $0 < \gamma \leq \frac{1}{L_s}$, either of the following holds for any $\mathbf{y}(\omega) \in H_s(\mathbf{x}-\gamma \mathbf{g}(\mathbf{x}, \omega))$ with any $\mathcal{I}_s^{\mathbf{y}(\omega)}$:
\begin{equation}
\mathbb{E}_{\omega}[
f(\mathbb{Y}(\omega))
]
\leq 
f(\mathbf{x}) 
-
\frac{\gamma}{2}(1-L_s\gamma)
\mathbb{E}_{\omega}[
\| \mathbf{g}_{\mathcal{I}_s^{\mathbb{Y}(\omega)}}(\mathbf{x}, \omega)\|^2_2
]
-
\frac{\gamma}{2}
\| \nabla_{\mathcal{I}_s^{\mathbf{x}}} f(\mathbf{x})\|^2_2
+
\gamma
\mathbb{E}_{\omega}[
\|\mathbf{\delta}_{\mathcal{I}_s^{\mathbb{Y}(\omega)}}(\omega)\|_2^2
] \label{eq:unbiasedwithxandy}
\end{equation}
where $\mathcal{I}(\omega) = \mathcal{I}_s^{\mathbf{x}} \cup \mathcal{I}_s^{\mathbb{Y}(\omega)}$ and $\mathbf{\delta}(\omega)=\mathbf{g}(\mathbf{x}, \omega) - \nabla f(\mathbf{x})$.
\end{lemma}
The following Theorem is the climax of our technical results because it establishes a stochastic gradient descent property for the expectation of the function value. Later we will see how Inequality (\ref{eq:generalminibatch}) is used in Theorem \ref{theorem:stochasticdescent} to show the sequence of the function values generated by the mini-batch SIHT is a supermartingale sequence.   

\begin{theorem}\label{theorem:sparseminibatch}
Let $f^{(i)}: \mathbb{R}^n\times \Xi \rightarrow \mathbb{R}$ be in $C^1$
\footnote{The class consisting of all differentiable functions whose derivative is continuous.}
for $i=1,\dots, N$ and $\Xi=\{\xi^{(1)}, \dots, \xi^{(N)}\}$ be a given set such that $f(\mathbf{x},\Xi)=\frac{1}{N}\sum_{i=1}^{N}f^{(i)}(\mathbf{x}, \xi^{(i)})$
be an $L_s$-RSS function.
Assume there exists a $c>0$
\footnote{In Remark \ref{remark:1}, we explain why such a $c$ always exist for widespread objective functions in machine learning applications}
such that 
\begin{equation}\label{eq:expectationindividualvsgradient}
\mathbb{E}_{\mathcal{J}} \Big [\sum_{i=1}^N \|\nabla_{\mathcal{J}} f^{(i)}(\mathbf{x}, \xi^{(i)})\|_2^2 \Big ]\leq c \mathbb{E}_{\mathcal{J}} \Big [\|\nabla_{\mathcal{J}} f(\mathbf{x},\Xi)\|_2^2 \Big ]
\end{equation}
for all $\mathbf{x} \in \mathbb{R}^n$ and any random index set $\mathcal{J} \subseteq \{1, \dots, n\}$ with $|\mathcal{J}| \leq s$.
Let $\mathcal{G}(\mathbf{x}, \Xi, B)=\frac{1}{|B|}\sum_{i \in B}\nabla f^{(i)}(\mathbf{x},\xi^{(i)})$ be the mini-batch stochastic gradient at any $\mathbf{x}\in \mathbb{R}^n$ where $B \subseteq \{1, \dots, N\}$ be a random set whose elements are drawn randomly and uniformly from $\{1, \dots, N\}$ without replacement and its size is $|B|$.
For a fixed $0<\gamma < \frac{1}{L_s}$, assume the size of $B$ is fixed such that $|B| \geq N/\Big(1+\frac{1-L_s\gamma}{1+L_s\gamma}\frac{N-1}{\frac{c}{N}-1}\Big)$ and let $\zeta := \frac{N-|\text{B}|}{|B|(N-1)}$ for $N \geq 2$. Then for a fixed $\mathbf{x} \in C_s$ with any $\mathcal{I}_s^{\mathbf{x}}$ the following holds for any $\mathbb{Y}(B) \in H_s(\mathbf{x}-\gamma \mathbf{g}(\mathbf{x}, \Xi, B))$ with any $\mathcal{I}_s^{\mathbb{Y}(B)}$:

\begin{equation}\label{eq:generalminibatch}
\begin{aligned}
\mathbb{E}_{B} \Big [ f(\mathbb{Y}(B),\Xi) \Big ]
&\leq
f(\mathbf{x}, \Xi) 
-
\frac{\gamma}{2}
\| \nabla_{\mathcal{I}_s^{\mathbf{x}}} f(\mathbf{x})\|^2_2
\\
&-
\frac{\gamma}{2}(1+L_s\gamma)\zeta
\Big(
1-
\frac{c}{N}
+\frac{1-L_s\gamma}{1+L_s\gamma}\frac{1}{\zeta}
\Big)
\mathbb{E}_{\mathcal{I}_s^{\mathbb{Y}(B)}}\Big[
\|\nabla_{\mathcal{I}_s^{\mathbb{Y}(B)}} f(\mathbf{x}, \Xi)\|^2
\Big]
\end{aligned}
\end{equation}
where $1-
\frac{c}{N}
+\frac{1-L_s\gamma}{1+L_s\gamma}\frac{1}{\zeta} \geq 0$.
\end{theorem}
A crucial assumption for proving the results in Theorem (\ref{eq:generalminibatch}) is the assumption made in Inequality (\ref{eq:expectationindividualvsgradient}). In the following Claim we show that for a certain class of functions $c>0$ always exists and it does not depend on the function. We will prove that for these special classes of functions the value of $c$ only depends on the data. 


\begin{claim}\label{claim:individualvsgradient}
Let the given set $\Xi$ in Problem (P) be defined such that $\Xi:=\{\mathbf{V}_{1\bullet}, \dots, \mathbf{V}_{N\bullet}\}$ where each $\mathbf{V}_{i\bullet}$ is the $i$-th row of a given matrix $\mathbf{V} \in \mathbb{R}^{N \times n}$. Then the objective function in Problem (P) can be defined as $f(\mathbf{x},\Xi):=\frac{1}{N}\sum_{i=1}^{N}f^{(i)}(\mathbf{V}_{i\bullet}\mathbf{x})$
$f^{(i)}: \mathbb{R}^n\times \Xi \rightarrow \mathbb{R}$ and the following holds:
\begin{equation}\label{eq:individualvsgradient}
\sum_{i=1}^N \|\nabla_{\mathcal{J}} f^{(i)}(\mathbf{V}_{i\bullet}\mathbf{x})\|_2^2
\leq  
\frac{N^2}{
\sigma_{min}^2(\mathbf{V}\mathbf{I}^{\top}_{\mathcal{J}\bullet}\mathbf{I}_{\mathcal{J}\bullet}\mathbf{V}^{\top})
}
\Big(
\max_{r=1, \dots, N}
\Big\{
\|(\mathbf{V}_{r\bullet}^{\top})_{\mathcal{J}}\|_2^2
\Big\}
\Big)
\|
\nabla_{\mathcal{J}}
f(\mathbf{x},\mathbf{V})\|_2^2   
\end{equation}
where $\mathcal{J} \subseteq \{1, \dots, n\}$ with $|\mathcal{J}| \leq s$,  $\mathbf{I}_{\mathcal{J}\bullet} \in \mathbb{R}^{|\mathcal{J}| \times n}$  is a restriction of the Identity matrix whose rows are associated with indices in $\mathcal{J}$, $\mathbf{V}\mathbf{I}^{\top}_{\mathcal{J}\bullet}\mathbf{I}_{\mathcal{J}\bullet}=\sum_{i=1}^{|\mathcal{J}|} \mathbf{V}_{\bullet i}\mathbf{V}_{\bullet i}^{\top}$, $\sigma_{min}(\cdot)$ is the smallest singular value, $\mathbf{V}_{\bullet i}$ is the $i$-th column of $\mathbf{V}$, and $(\cdot)\mathcal{J}$ is a vector restricted to indices in $\mathcal{J}$.
\end{claim}
\begin{remark}\label{remark:1}
The above claim shows that for a class of functions $f(\mathbf{x},\Xi):=\frac{1}{N}\sum_{i=1}^{N}f^{(i)}(\mathbf{V}_{i\bullet}\mathbf{x})$ the constant $c>0$ in Theorem \ref{theorem:stochasticdescent} always exists and it does not depend on the value of $\mathbf{x}$ or its gradient whether it is batch (full) gradient or individual one. For an example of functions belonging to this class one can think of the mean square error loss used for linear regression as follows:
$$
f(\mathbf{x}, \mathbf{V})=\frac{1}{N}\|\mathbf{V}\mathbf{x}-\mathbf{y}\|^2=\frac{1}{N}\sum_{i=1}^N(\mathbf{V}_{i\bullet}\mathbf{x}-y_i)^2
$$
where $\mathbf{V} \in \mathbb{R}^{N\times n}$, $\mathbf{V}_{i\bullet}$ is the $i$-th row of $\mathbf{V}$, $\mathbf{x} \in \mathbb{R}^n$ is the optimization variable, and $\mathbf{y} \in \mathbb{R}^N$ is the target. Also, the logistic regression loss (binary cross entropy) is a function for which $c>0$ in Inequality (\ref{eq:individualvsgradient}) always exists since it can be written as follows:
$$
f(\mathbf{x}, \mathbf{V})=\frac{1}{N}
\sum_{i=1}^{N}\Big( -y^{(i)}(\mathbf{V}_{i\bullet}\mathbf{x})+\log\big(1+e^{\mathbf{V}_{i\bullet}\mathbf{x}}\big)\Big)
$$
where $\mathbf{V} \in \mathbb{R}^{N\times n}$ whose last column is all one, $\mathbf{V}_{i\bullet}$ is the $i$-th row of $\mathbf{V}$, $\mathbb{R}^n\ni \mathbf{x}=[\mathbf{w}, b]^{\top}$ such that $\mathbf{w} \in \mathbb{R}^{n-1}$ and $b \in \mathbb{R}$ are the optimization variables, and $y^{(i)} \in \{0, 1\}$ for $i=1, \dots, N$.
\end{remark}
Now we can provide a result showing that by fixing a sparse point, one can use the stochastic mini-batch gradient with a fixed mini-batch size determined in Theorem \ref{theorem:stochasticdescent} and decrease the function value in expectation.

\begin{theorem}\label{theorem:stochasticdescent}
Assume all the assumptions in Theorem \ref{theorem:sparseminibatch} hold.
Then for a fixed $\mathbf{x} \in C_s$ with any $\mathcal{I}_s^{\mathbf{x}}$ the following holds for any $\mathbb{Y}(B) \in H_s(\mathbf{x}-\gamma \mathcal{G}(\mathbf{x}, \Xi, B))$:

\begin{equation}\label{eq:generalminibatchwithx}
\mathbb{E}_{B} \Big [ f(\mathbb{Y}(B),\Xi) \bigg\vert \mathbf{x}  \Big ]
\leq
f(\mathbf{x}, \Xi) 
-
\frac{\gamma}{2}
\| \nabla_{\mathcal{I}_s^{\mathbf{x}}} f(\mathbf{x})\|^2_2
.
\end{equation}
\end{theorem}

The above result is the analogue result to \cite[Corollary 1]{damadi2022gradient}. 

\begin{theorem}\label{theorem:functionconvergence}
Assume all the assumptions in Theorem \ref{theorem:sparseminibatch} hold.
Let $f$ be a bounded below differential function and $\big(\mathbb{X}^k \bigg\vert \mathbb{X}^{k-1})_{k\geq 0}$ be the stochastic IHT sequence.
Then, $\Big(f(\mathbb{X}^{k}, \Xi, B) \bigg\vert \mathbb{X}^{k} \Big)_{k\geq 1}$ is a supermartingale sequence and converges to a random variable $f^*$ with probability one. 
\end{theorem}

\section{Conclusion}

We showed the stochastic sequence generated by the mini-batch stochastic IHT is a supermartingale sequence converging with probability one. To show this result we used the stochastic gradient descent property that we derived where we utilized the property of the mini-batch stochastic gradient as the sample sum of a finite sum.

\newpage






























































































