% !TEX root = main.tex

\section{Experimental Results} \label{sec:exp}



In this section, we conduct numerical experiments to validate our \alg algorithm family, focusing on non-convex MOO problems, while results for strongly convex and 8-objective MOO experiments are in the appendix.




%\subsection{Non-convex Optimization Problem}

\textbf{1) Two-Objective Experiments on the MultiMNIST Dataset:}
First, we test the convergence performance of our \alg using the ``MultiMNIST'' dataset~\citep{sabour2017dynamic}, which is a multi-task learning version of the MNIST dataset \citep{lecun2010mnist} from LIBSVM repository. 
Specifically, MultiMNIST converts the hand-written classification problem in MNIST into a two-task problem, where the two tasks are task ``L'' (to categorize the top-left digit) and task ``R'' (to classify the bottom-right digit).
The goal is to classify the images of different tasks. 
We compare our \alg algorithms with MGD, SMGD, CR-MOGM, and MOCO.
All algorithms use the same randomly generated initial point. 
The learning rates are chosen as $\eta=0.3,\alpha=0.5$,  constant $c=c_{\gamma}=c_{\epsilon} = 32$ and solution accuracy $\epsilon = 10^{-3}$.
The batch-size for MOCO, CR-MOGM and SMGD is $96$.
The full batch size for MGD is $1024$, and the inner loop batch-size $|\mathcal{N}_s|$ for \algns, \algmns, \algpns, \algmpns is $96$. 
As shown in Fig.~\ref{fig:compare_mnist}(a), SMGD exhibits the slowest convergence speed, while MOCO has a slightly faster convergence.
%due to its incorporation of momentum. Th
MGD and our \alg algorithms have comparable performances. 
%due to their use of variance reduction techniques. 
The \algm/\algmp algorithms converge faster than MGD, \alg, and \algp, primarily due to the use of momentum. 
Fig.~\ref{fig:compare_mnist}(b) highlights differences in sample complexity. 
MGD suffers the highest sample complexity, while \algp and \algmp demonstrate a more efficient utilization of samples in comparison to \alg and \algmns.
These results are consistent with our theoretical analyses as outlined in Theorems~\ref{thm:STIMULUS_nonC}, \ref{STIMULUSM_NonC}, and \ref{thm:STIMULUSmp_nonC}.









\textbf{2) 40-Objective Experiments with the CelebA Dataset:} 






Lastly, we conduct large-scale 40-objective experiments with the CelebA dataset \citep{liu2015deep}, which contains 200K facial images annotated with 40 attributes. 
Each attribute corresponds to a binary classification task, resulting in a 40-objective  problem. 


\begin{wrapfigure}{r}{0.27\textwidth}
  \includegraphics[width=0.28\textwidth]{40tasks.pdf}
\caption{Training loss convergence comparison (40-task).}
\label{fig_compare_40tasks}
\end{wrapfigure}

%To create a shared representation function, For a fair comparison, 
We use a ResNet-18 \cite{he2016deep} model without the final layer for each attribute, and we attach a linear layer to each attribute for classification. 
In this experiment, we set $\eta=0.0005,\alpha=0.01$, the full batch size for MGD is $1024$, and the batch size for SMGD, CR-MOGM and MOCO and the inner loop batch size $|\mathcal{N}_s|$ for \algns, \algmns, \algpns, \algmpns is $32$. 
As shown in Fig.~\ref{fig_compare_40tasks}, MGD, \algns, \algmns, \algpns, and \algmpns significantly outperform SMGD, CR-MOGM and MOCO in terms of training loss. 
Also, we would like to note that \algp and \algmp consume fewer sample (approximately 11,000) samples compared to \alg and \algm, which consume approximately 13,120 samples, and MGD, which consumes roughly 102,400 samples. 
These results are consistent with our theoretical results in Theorems~\ref{thm:STIMULUS_nonC}, \ref{STIMULUSM_NonC}, and \ref{thm:STIMULUSmp_nonC}.












\iffalse
\textbf{3) MLLTR experiments}

\textbf{Dummy Data}
The generation of dummy data is conducted through a specified procedure that is designed to synthesize representative datasets suitable for ranking tasks or similar machine learning applications. The process is parameterized by the following elements: A feature matrix X is created by drawing values from a standard normal distribution. The matrix has dimensions (num of queries $*$ number of results  associated with each individual query, num of features). Based on the feature matrix, a corresponding label array y is derived. The values in X are normalized to the range [0, 1], and the mean of each row is calculated. The mean values are then scaled by the num labels parameter and truncated to integers to form the labels, resulting in values ranging from 0 to num labels $- 1$.


\textbf{Microsoft Learning to Rank Dataset}
In the experiment we conducted, the focus was on implementing the Multi-label learning to rank model within the context of the Microsoft Learning to Rank web search dataset. This particular dataset is widely known as MSLR-WEB30K\cite{DBLP}. Structurally, it is divided into five unique data folds, and our investigation was specifically concentrated on the first of these folds, aptly named Fold1.

Each document within this extensive dataset is characterized by a set of 136 numerical features. Alongside these features, a label that signifies relevance on a scale ranging from 0 to 4 is assigned to every document. This provides a tiered insight into the relevance of the data.

To develop a more intricate and nuanced method of labeling, we took inspiration from the approach outlined in a previous study. Within the broad spectrum of 136 features, we chose to adopt four specific ones that we deemed to add an extra layer of relevance labeling. These were Query-URL Click Count (Click), URL Dwell Time (Dwell), Quality Score (QS), and Quality Score2 (QS2). It's worth noting that during the training phase, these labels were deliberately omitted to ward off any unintended infiltration of target data, thereby ensuring the integrity of the experiment.
\fi




\iffalse
\textbf{We will conduct more experiments to calculate the result variance.}

\begin{wrapfigure}{r}{0.35\textwidth}
  \centering
  \includegraphics[width=0.25\textwidth]{EPO.png}
\caption{Training loss convergence comparison (8tasks) EPO.}
\end{wrapfigure}
\fi

\iffalse
\begin{figure*}[h]
	\centering
	\subfigure[Training loss convergence in terms of sample complexities.]{
	\includegraphics[width=0.23\textwidth]{5.png}
		\includegraphics[width=0.225\textwidth]{5.png}
	}
	\hspace{0.001\textwidth}
	\subfigure[Training loss convergence in terms of communication rounds.]{
	\includegraphics[width=0.22\textwidth]{5.png}
		\includegraphics[width=0.22\textwidth]{5.png}
		\label{fig:AUC_grad_a9a_sample}
	}
	\caption{Training loss convergence comparison (8tasks)RPO.}
% \label{fig_compare_K_communication rounds_main}
\end{figure*}


Notes: 09/07/2023
1. if you can update the paper working plan in experiment and theory this week, please share with the group this week. I can plan support accordingly.

2. Check Haibo Rebuttal. Special case of stochastic version algorithm(Maybe the motivation of this paper.)

3. Share the overleaf link to everyone.

4. Do the experiements on the nonconvex case and the strongly-convex case, respectively. 


Notes: 09/18/2023


1. add adaptive batch size.

2. more experiements.
\fi

