\section{Additional Experiments}
In this section, we present some additional details of our experiments.  In particular, we present additional
detail about the experimental setup in Section \ref{appdx:exp_setup}. 
% We then provide additional experimental setup details from the main paper in Section \ref{details} as well as details for the influence experiments. 
Next, we present the additional experimental results in Section \ref{results}.

\subsection{Additional experimental setup}
\label{appdx:exp_setup}
First of all, we provide details about the two applications used to evaluate our algorithms. The two applications considered here are noisy data summarization as presented in Section \ref{sec:data_summarization} and influence maximization in Section \ref{influence}.

\subsubsection{Noisy data summarization}
\label{sec:data_summarization}
In data summarization, $U$ is a dataset that we wish to summarize by choosing a subset of $U$ of cardinality at most $\kappa$. The objective function $f:2^U\to\mathbb{R}_{\geq 0}$ takes a subset $X\subseteq U$ to a measure of how well $X$ summarizes the entire dataset $U$, and in many cases is monotone and submodular \cite{tschiatschek2014learning}. However, in real instances of data summarization, we may not have access to an exact measure $f$ of the quality of a summary, but instead, we may have authentic human feedback which is modeled as noisy queries to some underlying monotone and submodular function \cite{singla2016noisy}.

Motivated by this, we run our experiments using instances of noisy data summarization. Our underlying monotone submodular function $f$ is defined as follows: $U$ is assumed to be a labeled dataset, e.g. images tagged with descriptive words, and for any $X\subseteq U$, $f$ takes $X$ to the total number of tags represented by at least one element in $X$ \cite{crawford2023scalable}. Notice that this is essentially the instance of set cover.

%We use the data summarization setup as in \citealp{crawford2023scalable} but introduce noise to the marginal gains. In particular, $U$ is assumed to be a labeled dataset, e.g. images tagged with descriptive words, and for any $X\subseteq U$, $f$ takes $X$ to the total number of tags represented by at least one element in $X$. The noisy marginal gain is obtained by adding a zero-mean Gaussian random variable with $\sigma=1.0$ ($\sigma$ is the standard deviation) to the exact value of marginal gain. Therefore, parameter $R=2.0$. Notice that this is essentially a noisy set cover. 


%The submodular maximization problem with human feedback is a natural application of our setting where the access to function $f$ is noisy. Consider in the example of a recommender system with a total of $n$ products. For each subset $S$ of the ground set $[n]$, a submodular utility function is defined to represent the user's preference for the products in set $S$. Since the function is evaluated by humans, it is assumed to be noisy. The objective is to find a subset of $k$ products that maximizes the user's experience through getting noisy feedback from humans. The noisy feedback of adding new elements can be assumed to be bounded. For example, it could be Bernoulli r.v. if the human assessment is "like" and "dislike". 
%\subsection{Applications}
%\subsubsection{Data Summarization}
%The first application that we consider is data summarization via a facility location objective. Suppose that we have a universe $U$ of data points that we wish to summarize by selecting a subset $S\subseteq U$. Let $s:U\times U\to [0,1]$ be a measure of similarity between two data points, e.g. the cosine similarity between vectors with nonnegative entries. The monotone submodular function $f$ defined as follows measures the effectiveness of the summary.
%\begin{align*}
%    f(X) = \frac{1}{n}\sum_{u\in U}\max_{x\in X}s(x,u).
%\end{align*}
%This application falls under the setting considered in this paper because, given $X\subseteq U$ and $v\in U$, we have an estimation of $\Delta f(X,v)$ by uniformly randomly selecting an element $u$ of $U$ and computing
%\begin{align*}
%    \max\{s(u,v)-\max_{x\in X}s(x,u), 0\}.
%\end{align*}
%If we assume that computing $\Delta f(X,v)$ exactly can be done in $O(n)$ time, then one question is whether the number of samples required by our algorithm in the worst case where we sample down to a multiplicative $\epsilon$-approximation is less than computing the objective exactly. Whether this is the case or not depends on the ratio $R/\mu$ in Hoeffding's Lemma, which depends on the particular instance. However, we can see that it is true in at least some cases by the following example. Consider computing an estimate of $\Delta f(\emptyset, u)$ where $s(u,x)\approx\ell$ for all $x\in U$. Then in this case $R\approx\mu\approx\ell$. Then by Hoeffding's Lemma, we need at most about $\ln(1/\delta)/\epsilon^2$ samples to get a multiplicative $\epsilon$ approximation for this marginal gain with probability at least $1-\delta$.
%\subsubsection{Submodular Maximization with Human Assessment}
%The submodular maximization problem with human feedback is a natural application of our setting where the access to function $f$ is noisy. Consider in the example of a recommender system with a total of $n$ products. For each subset $S$ of the ground set $[n]$, a submodular utility function is defined to represent the user's preference for the products in set $S$. Since the function is evaluated by humans, it is assumed to be noisy. The objective is to find a subset of $k$ products that maximizes the user's experience through getting noisy feedback from humans. The noisy feedback of adding new elements can be assumed to be bounded. For example, it could be Bernoulli r.v. if the human assessment is "like" and "dislike". 
\subsubsection{Influence maximization}
\label{influence}
Another application is the influence maximization problem in large-scale networks \cite{kempe2003maximizing}. In this application, the universe is the set of users in the social network, and the objective is to choose a subset of users to seed with a product to advertise in order to maximize the spread throughout the network. The marginal gain of adding an element $s$ to set $S$ is defined as $\Delta f(S,s):=\mathbb{E}_{\bf{w}\sim \mathcal{D}(\bar{\bf{w}})}\Delta f (S,s;\bf{w})$, where $\bf{w}$ is the noisy realization of the graph from some unknown distribution $\mathcal{D}(\bar{\bf{w}})$, and $\Delta f (S,s;\textbf{w}) = f(S\cup\{s\};\textbf{w})- f(S;\bf{w})$. In a noisy graph realization with parameter $\bf{w}$, $f(S;\bf{w})$ is the number of elements influenced by the set $S$ under some influence cascade model. It is \#P-hard to evaluate the objective in influence maximization \cite{chen2010scalable}. Many of the previous works \cite{chen2009efficient} assume the entire graph can be stored by the algorithm and the influence cascade model is known. The algorithm first samples some graph realizations to approximate the true objective and run submodular maximization algorithms on the sampled graphs. In contrast, our setting and algorithm do not assume that a graph is stored or the model of influence is explicitly known, only that we could simulate it for a subset. Therefore our approach could apply in more general influence maximization settings than the sampled realization approach.

Next, we describe the details about the three algorithms that we compare to:
(i) The fixed $\epsilon$ approximation (``\texttt{EPS-AP}'') algorithm. This is where we essentially run \alg, except instead of using the subroutine \samp to adaptively sample in order to reduce the number of samples, we simply sample down to an $\epsilon$-approximation of every marginal gain. This takes $N_1$ samples for every marginal gain computation, see definition of $N_1$ in Algorithm \ref{alg:samp}. The element $u$ is added to $S$ if and only if the empirical estimate $\widehat{\Delta f_{N_1}}(S,u)\geq w$; (ii)
The special case of the algorithm \singla of \citet{singla2016noisy} that yields about a $(1-1/e)$-approximate solution with high probability, ``\texttt{EXP-GREEDY}'', which is described in Section \ref{sec:relatedwork} and in the appendix. In the detailed description of \singla found in the appendix in the supplementary material, this is the case that $k'$ is set to be 1; (iii) The randomized version of the algorithm of \singla, ``\texttt{EXP-GREEDY-K}'', which yields about a $(1-1/e)$-approximation guarantee in expectation. Since \texttt{EXP-GREEDY-K} is a randomized algorithm, we average the results for \texttt{EXP-GREEDY-K} over $10$ trials. This is the case that $k'=\kappa$.
%\texttt{EPS-AP} has the same main algorithm (Algorithm \ref{alg:ATG}) with our approach but with a different subroutine sampling algorithm \samp. The subroutine algorithm for \texttt{EPS-AP} first samples $N_1$ noisy queries to $\Delta f(S,s)$, The third and fourth algorithms ("\texttt{EXP-GREEDY}" and "\texttt{EXP-GREEDY-K}") are instances of EXP-GREEDY in \cite{singla2016noisy} with $k'$ setting to be $1$ and $\kappa$ separately. 
 % The subroutine algorithm TOPX of \texttt{EXP-GREEDY} selects the item with the highest marginal gain. \texttt{EXP-GREEDY-K} algorithm uses TOPX to perform top-$l$ subset selection for $l=\{1,2,3,...,\kappa\}$ and then randomly chooses one element from the output subset of TOPX to add to the solution set.


% \subsubsection{Additional Experimental Setup}
% \label{details}
Then we provide some additional details for experiments on instances of data summarization. 
The parameter $\delta$ for all the experiments is set to be $0.2$, and the approximation precision parameter $\alpha$ is $0.2$ for both \alg and \texttt{EPS-AP}. 
The value of $\epsilon$ of the experiments for different $\kappa$ are $0.1$, $0.2$, $0.1$ and $0.1$ on corel\_60, delicious\_300, delicious, and corel respectively. The value of $\kappa$ for different $\epsilon$ are $10$, $80$, $200$ and $100$ on corel\_60, delicious\_300, delicious and corel respectively.

At last, we introduce the experimental setup for influence maximization. We run the four algorithms described above on the experiments for different values of $\kappa$ and $\epsilon$. The dataset used here is a sub-graph extracted from the EuAll dataset with $n=29$ \cite{leskovec2016snap}. The underlying weight of each edge is uniformly sampled from $[0,1]$ (``euall''). In our experiments, we simulate the influence maximization under the influence cascade model. We further use the reverse influence sampling (RIS) \cite{borgs2014maximizing} to enhance the computation efficiency of our algorithm. Here $R$ is the number of nodes in the graph and is thus $29$. The value of $\kappa$ for different $\epsilon$ is $8$, and the value of $\epsilon$ for different $\kappa$ is $0.15$. The parameters $\delta$ and $\alpha$ are set to be $0.2$ for both of the experiments. Since \texttt{EXP-GREEDY-K} is a randomized algorithm, the experimental results for \texttt{EXP-GREEDY-K} are averaged over $4$ trials for different $\epsilon$, and  $8$ trials for different $\kappa$.

\subsection{Addtional experimental results}
\label{results}
First, we present the result analysis of the experiments where we vary $\epsilon$. It can be seen from Figures \ref{fig:cover2500_300_eps_q}, \ref{fig:cover2500_300_eps_average-q}, \ref{fig:corel_60_eps_q} and \ref{fig:corel_60_eps_ave_q} that both the total samples and average samples of our algorithm \alg increase less compared with \texttt{EPS-AP} and \texttt{EXP-GREEDY} as $\epsilon$ decreases. 
% The total and average query required by \alg is significantly better than that by \texttt{EPS-AP} and the samples increases rapidly when $\epsilon$ decreases. 
This is not surprising, because the theoretical guarantee on the number of samples taken per marginal gain contribution in \texttt{EPS-AP} is $O(\frac{1}{\epsilon^2})$, 
which would increase rapidly when $\epsilon$ decreases. This also makes sense for \texttt{EXP-GREEDY}, since the theoretical guarantee on the number of queries of each iteration is $O(\frac{nR^2}{\epsilon^2}\log\big(\frac{R^2kn}{\delta\epsilon^2}\big))$ if the difference between elements marginal gains are very small.



Then we present the additional experimental results with respect to the function value $f$ on the instance of data summarization in the main paper. The results are in Figure \ref{fig:exp_results_of_f}. The experimental results of $f$ for different $\kappa$ are in Figure  \ref{fig:cover2500_300_k_f}, \ref{fig:cover_k_f},  \ref{fig:corel_60_k_f} and  \ref{fig:corel_k_f}. From the results, one can see that the $f$ values for different algorithms are very almost the same in most cases. However, when $\kappa$ increases and becomes large, the $f$ value of \texttt{EXP-GREEDY-K} is smaller than other algorithms, which is because when  $\kappa$ is large, it allows for more randomness in \texttt{EXP-GREEDY-K} and is  less accurate.


Next, we present the experimental results on the instance of influence maximization. The results are plotted in Figure \ref{fig:exp_results_of_influmax}. From the results, we can see that our proposed algorithm \alg outperforms the other three algorithms in terms of the total number of samples (see Figure \ref{fig:influ_eps_q}, \ref{fig:influ_k_q}). When $\kappa$ increases, the average number of samples decreases fast for \alg. This is because the marginal gain on this instance decreases rapidly when $\kappa$ increases while the threshold value decreases only by a factor of $1-\alpha$ at the end of each iteration, in many iterations the threshold value $w$ is much higher than the marginal gain and thus the gap function $\phi(S,s)$ is large. According to the results of sample complexity in Theorem \ref{mainthm}, the number of required samples decreases fast as $\kappa$ increases. This is also why the average number of samples of \alg is much smaller than \texttt{EXP-GREEDY} and \texttt{EXP-GREEDY-K} as is presented in Figure \ref{fig:influ_eps_ave_q} and Figure \ref{fig:influ_k_ave_q}.
\begin{figure*}[t!]
    \centering
    \hspace{-0.5em}
     \subfigure[delicious\_300 $f$]
{\label{fig:cover2500_300_eps_f}\includegraphics[width=0.24\textwidth]{figures/cover_n2500_300_sigma_1.0_atg_eps-f.pdf}} 
\hspace{-0.5em}
     \subfigure[delicious\_300 $f$]
{\label{fig:cover2500_300_k_f}\includegraphics[width=0.24\textwidth]{figures/cover_n2500_300_sigma_1.0_atg_k-f.pdf}} 
    \hspace{-0.5em}
     \subfigure[corel\_60 $f$]
{\label{fig:corel_60_eps_f}\includegraphics[width=0.24\textwidth]{figures/corel_60_sigma_1.0_atg_eps-f.pdf}} 
\hspace{-0.5em}
     \subfigure[corel\_60 $f$]
{\label{fig:corel_60_k_f}\includegraphics[width=0.24\textwidth]{figures/corel_60_sigma_1.0_atg_k-f.pdf}} 
\hspace{-0.5em}
     \subfigure[corel $f$]
{\label{fig:corel_eps_f}\includegraphics[width=0.24\textwidth]{figures/corel_sigma_1.0_atg_eps-f.pdf}} 
\hspace{-0.5em}
     \subfigure[corel $f$]
{\label{fig:corel_k_f}\includegraphics[width=0.24\textwidth]{figures/corel_sigma_1.0_atg_k-f.pdf}} 
\hspace{-0.5em}
     \subfigure[delicious $f$]
{\label{fig:cover_eps_f}\includegraphics[width=0.24\textwidth]{figures/cover_n5000_sigma_1.0_atg_eps-f.pdf}} 
\hspace{-0.5em}
     \subfigure[delicious $f$]
{\label{fig:cover_k_f}\includegraphics[width=0.24\textwidth]{figures/cover_n5000_sigma_1.0_atg_k-f.pdf}} 
\caption{The experimental results of $f$ of running different algorithms on instances of data summarization on the delicious URL dataset ("delicious", "delicious\_300") and Corel5k dataset ("corel", "corel\_60").}
\label{fig:exp_results_of_f}
\end{figure*}

\begin{figure*}[t!]
    \centering
    \hspace{-0.5em}
     \subfigure[euall samples]
{\label{fig:influ_eps_q}\includegraphics[width=0.24\textwidth]{figures/euallweighted_k_[8]_atg_eps-q.pdf}} 
\hspace{-0.5em}
     \subfigure[euall average samples]
{\label{fig:influ_eps_ave_q}\includegraphics[width=0.24\textwidth]{figures/euallweighted_k_[8]_atg_eps-ave_q.pdf}} 
    \hspace{-0.5em}
     \subfigure[euall $f$]
{\label{fig:influ_eps_f}\includegraphics[width=0.24\textwidth]{figures/euallweighted_k_[8]_atg_eps-f.pdf}} 
\hspace{-0.5em}
     \subfigure[euall samples]
{\label{fig:influ_k_q}\includegraphics[width=0.24\textwidth]{figures/euallweighted_sigma_1.0_atg_k-q.pdf}} 
\hspace{-0.5em}
     \subfigure[euall average samples]
{\label{fig:influ_k_ave_q}\includegraphics[width=0.24\textwidth]{figures/euallweighted_sigma_1.0_atg_k-ave_q.pdf}} 
\hspace{-0.5em}
     \subfigure[euall $f$]
{\label{fig:influ_k_f}\includegraphics[width=0.24\textwidth]{figures/euallweighted_sigma_1.0_atg_k-f.pdf}} 
\caption{The experimental results of running different algorithms on the instance of influence maximization on the EuAll dataset ("euall").}
\label{fig:exp_results_of_influmax}
\end{figure*}


% \section{Limitations}\label{apdx:limit}

% While we demonstrate the effectiveness of our proposed sample strategy, \samp{}, by integrating it with four existing algorithms, this is just an early step towards bridging noisy submodular maximization and multi-armed bandits. A wider range of submodular maximization algorithms could potentially benefit from \samplong{}, leading to more empirically valuable methods for real-world applications. We view this work as a springboard for further exploration. Beyond submodular maximization, applications like submodular cover could also be investigated in combination with our strategy, under scenarios with noisy query access limitations. 

% % \section{Licenses}\label{apdx:license}

% % The datasets Corel5k~\citep{duygulu2002object} and EuAll~\citep{leskovec2016snap} are released under CC BY 4.0 license. 

% % \section{Broader Impact}\label{apdx:braoder-impact}

% This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.