\section{Appendix}
\label{sec:appendix}

In the Appendix, we present a detailed derivation for the proof of Proposition \ref{prop3} and proof that the proposed policies are consistent, ensuring clarity and rigor in our methodology. Additionally, we include a comprehensive discussion of the results from further experiments, offering valuable insights that reinforce our findings.

\section{Proof of Proposition \ref{prop3}}
\label{proof_of_proposition}

From Eq. (\ref{eq9}), we obtain the following value function.
\begin{align}
    V(S^{T}) & \dot=  \underset{\pi}{argmin} \ \mathbb{E}^{\pi} \left[ \mathbb{E} \left( H^{T}(V) + H^{T}(E) \right) \right]. \nonumber
\end{align}
We define $V(S^{0}) = H^{0}(V) + H^{0}(E)$. From Eq. (\ref{eq10}), the reward function is defined as
\begin{align}
    R(S^{t}, v_{t}) = &\mathbb{E} (( H^{t}(V) +H^{t}(E)) \nonumber \\
 &- (H^{t+1}(V) + H^{t+1}(E)) | \mathbf{S^{t}}, v_{t}). \nonumber
\end{align}
Substituting the value of $t=0$, we get
\begin{align}
R(S^{0}, v_{0}) = \mathbb{E} (( H^{0}(V) +H^{0}(E)) - (H^{1}(V) + H^{1}(E))), \nonumber
\end{align}
and substituting the value of $t=1$, we get
\begin{equation}
R(S^1, v_1) = \mathbb{E} \left[ \big(H^1(V) + H^1(E)\big) - \big(H^2(V) + H^2(E)\big) \right].
\end{equation}
We can observe that the first term in $R(S^{1}, v_{1})$ and the second term in $R(S^{0}, v_{0})$ get canceled out if we add the rewards for these two timestamps. Therefore, we get
\begin{align}
 \sum_{t=0}^{T-1} R(S^{t}, v_{t}) = &\mathbb{E} (\left( H^{0}(V) + H^{0}(E)\right) \nonumber\\
 &- \left(H^{T}(V) + H^{T}(E)\right)) \nonumber
\end{align}
Substituting $V(S^{T})$ and $V(S^{0})$ into the equation, we get
\begin{align}
    \underset{\pi}{\mathrm{sup}} \ \mathbb{E}^{\pi} \left( \sum_{t=0}^{T-1} R(S^{t}, v_{t}) \right) =  V(S^{0}) - V(S^{T}). \nonumber
\end{align}
Therefore,
\begin{align}
V(S^T) = V(S^{0}) - \underset{\pi}{\mathrm{sup}} \ \mathbb{E}^{\pi} \left( \sum_{t=0}^{T-1} R(S^{t}, v_{t}) \right). \nonumber
\end{align}

\section{Proposed Policies are Consistent}
\label{consistency_proof}

In OPTUENT-OPT, we select the vertex $v_t$ in each iteration as follows: 
\begin{align}
    v_t = \underset{v}{\mathrm{argmax}} \left( R^{+}(\mathbf{S^{t}}, v)\ \dot=\ \max(R_{1}(\mathbf{S^{t}}, v), R_{2}(\mathbf{S^{t}}, v))\right). \nonumber
\end{align}
The expected reward $R^{+}(\mathbf{S^{t}}, v_{t})$ depends solely on changes to the marginal probability of vertex $v_t$ due to the obtained label. Since the reward, as specified in Eq. (\ref{eq10}), considers the change in entropy of vertices and edges between two timestamps, we have:
\begin{align}
    R(S^{t}, v_{t}) = &\mathbb{E} (( H^{t}(V) +H^{t}(E)) \nonumber\\ 
    &- (H^{t+1}(V) + H^{t+1}(E)) | \mathbf{S^{t}}, v_{t}), \nonumber
\end{align}
Since the marginal probability of each vertex is updated based only on its own posterior probability and those of the leaf vertices in the factor graph, as $T \to \infty$, changes in edge entropy become negligible, as evidenced by the empirical experiments in section \ref{sec:rfr}. Therefore, 
we focus solely on the entropy of vertex labeling. Therefore, the reward function is updated as:
\begin{align}
    R(S^{t}, v_{t}) = \mathbb{E} ( H^{t}(V) - H^{t+1}(V) | \mathbf{S^{t}}, v_{t}), \nonumber
\end{align}
where $H^{t}(V)$ is given by:
\begin{align}
    H^{t}(V) = &\sum_{v \in V} -(( 1 - h(P^{t}_{v}) log(1 - h(P^{t}_{v}))) \nonumber\\
    &+ (h(P^{t}_{v})) log(h(P^{t}_{v}))),
\end{align}
with $h(x) = max(x, 1-x)$. The posterior probability $P^{t}_{v}(+1)$ can be calculated using Eq. (\ref{eq7}), with $P^{t}_{v}(-1) = 1 - P^{t}_{v}(+1)$.

The reward function $R^{+}(S^{t}, v_t)$ remains positive for all $t$ because entropy is a submodular function, and entropy minimization always provides gain, ensuring that each labeling action contributes additional information and prevents entropy from increasing, as long as uncertainty remains. The Beta distribution posterior updates further support this by ensuring that each labeling action increases either $a_v^t$ or $b_v^t$, thereby reducing node entropy and guaranteeing a nonzero expected reward. Additionally, the greedy selection of the maximum expected reward ensures that the policy always picks the vertex that maximizes entropy reduction, meaning there is always at least one vertex with a positive expected reward. While $R^{+}(S^{t}, v_t)$ diminishes over time as nodes become more certain, it never reaches zero at finite $t$ since labeling continues until full certainty is achieved. The condition $\lim_{a_v^t + b_v^t \to \infty} R^{+}(S^{t}, v_t) = 0$ ensures eventual convergence but does not imply that rewards vanish during the process. Furthermore, since $ R^{+}(S^{t}, v_t) > 0 $ for all $ t $, from the properties of Beta distributions, we know that as $a_v^t + b_v^t \to \infty$, the variance of the Beta distribution $\text{Var}(P^{t}_{v}) = \frac{(\alpha + a^{t}_{v})(\beta + b^{t}_{v})}{(\alpha + a^{t}_{v} + \beta + b^{t}_{v})^2(\alpha + a^{t}_{v} + \beta + b^{t}_{v} + 1)}$ tends to zero, ensuring convergence of the posterior to a deterministic value. 

Consequently, the posterior probability update magnitude decreases:
\begin{align}
    \lim_{a^{t}_{v} + b^{t}_{v} \to \infty} \left(h(P^{t+1}_{v}(+1)) - h(P^{t}_{v}(+1)) \right) = 0.
\end{align}
Thus, the reward function satisfies:
\begin{align}
    &\lim_{a^{t}_{v} + b^{t}_{v} \to \infty} R(S^{t}, v_t) = 0, \quad \text{and hence,}\nonumber\\
    \quad &\lim_{a^{t}_{v} + b^{t}_{v} \to \infty} R^{+}(S^{t}, v_t) = 0.
\end{align}

Implying that OPTUENT-OPT labels each instance infinitely as $T$ increases. Given that we assume workers are reliable, this leads to convergence on $\theta_{v_i}$ for each $v_i \in V$ and $\omega_{e_k}$ for every edge $e_k \in E$. Thus, the overall entropy for vertices and edges converges to a constant value, confirming that OPTUENT-OPT is a consistent policy.

\noindent\textbf{Consistency of OPTUENT-EXP} In the OPTUENT-EXP policy, each iteration selects the vertex $v_t$ as follows:
\begin{align}
    v_t = \underset{v}{\mathrm{argmax}} \left( R(\mathbf{S^{t}}, v)\ \dot=\ p_1 R_{1}(\mathbf{S^{t}}, v) + p_2 R_{2}(\mathbf{S^{t}}, v)\right). \nonumber
\end{align}
While the initial changes in marginal probabilities for $v_t$ may be similar due to all vertices starting with a Beta prior distribution $Beta(\alpha, \beta)$, their impact on the graph varies based on instance correlations and vertex degrees, leading to different rewards. If the label probability $\theta_{v}$ of vertex $v \in V$ differs from $0.5$, then $R_{1}(\mathbf{S^{t}}, v)$ may not equal $R_{2}(\mathbf{S^{t}}, v)$ if $a_{v}^{t} \neq b_{v}^{t}$. Even when $\theta_{v} = 0.5$, rewards can still differ based on worker labels, ensuring $R(\mathbf{S^{t}}, v_t) \neq 0$ whenever $a_{v}^{t} \neq b_{v}^{t}$. As the budget increases, changes in instance correlations become negligible, yet the difference between $a_{v}^{t}$ and $b_{v}^{t}$ ensures $R(\mathbf{S^{t}}, v_t) \neq 0$. 

In OPTUENT-OPT, the policy selects the node with the highest optimistic reward, ensuring that the most uncertain and informative node is labeled at every step. However, in OPTUENT-EXP, the policy selects the node based on the expected reward, which takes into account the probabilities of both possible labeling outcomes. This means that rather than always picking the node with the highest potential entropy reduction, OPTUENT-EXP chooses nodes that, on average, significantly reduce entropy.

Despite this difference, OPTUENT-EXP still ensures that $R(S^{t}, v_t) > 0$ for all $t$ because the expected entropy reduction remains positive as long as there are remaining uncertain nodes. While the selection process is more balanced, it does not lead to the premature selection of fully certain nodes. Instead, it systematically reduces uncertainty across the graph, ensuring that every node is labeled sufficiently over time.

Thus, with $R(\mathbf{S^{t}}, v_t) > 0$ for any positive integers $a_{v}^{t}$ and $b_{v}^{t}$, each vertex continues to be labeled indefinitely as $T \to \infty$. As labeling continues, entropy minimization ensures that the posterior probabilities stabilize, leading to accurate label estimation. This guarantees that the estimated labels converge to the true values, proving that OPTUENT-EXP is consistent.

\section{Ablation Studies}

\subsection{Influence of Sample Size}
\label{sample_size}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\textwidth]{Figures/different_sample_sizes.pdf}
    \caption{Performance of OPTUENT-EXP with different sample sizes on the WebKB dataset for a fixed $\theta_{v}=0.65$.}
    \label{fig:dss}
\end{figure}

The experiments presented in Figure \ref{fig:main_plot} and the additional analyses in the Appendix utilize a sample size of 10. While it is intuitive to assume that a larger sample size would yield better performance by providing a greater pool of candidate vertices, our findings suggest otherwise. To test this assumption, we conducted experiments with varying sample sizes of 10, 20, and 30 using the OPTUENT-EXP policy, as illustrated in Figure \ref{fig:dss}. Remarkably, the results indicate that even with a sample size of just 10, the performance is robust and effective. As the budget increases, the performance gains from larger sample sizes diminish, reinforcing the conclusion that a sample size of 10 is not only sufficient but also optimal for achieving high-quality outcomes in our experiments. This efficiency allows for resource conservation while maintaining competitive performance.

\subsection{Performance of Random Forest Regressor}
\label{sec:rfr}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\textwidth]{Figures/RFR.pdf}
    \caption{Performance of Random Forest Regressor for Cora and Pubmed datasets.}
    \label{fig:rfr}
\end{figure}

To evaluate the effectiveness of the Random Forest Regressor, for instance, correlation estimation, we present performance plots for the Cora and PubMed datasets. In these experiments, we assume reliable workers and utilize Equation (\ref{eq4}) to compute the marginal probabilities for labeled edges. The regressor is trained exclusively on these labeled edges, and the trained model is subsequently employed to predict the correlations of the remaining unlabeled edges.

As illustrated in Figure \ref{fig:rfr}, the results reveal that the Random Forest Regressor performs remarkably well, even when less than 5\% of the edges are labeled. Notably, performance improves significantly with an increase in the proportion of labeled edges, achieving over 90\% accuracy with just 15\% labeled data for both datasets. These findings demonstrate that the Random Forest Regressor effectively estimates instance correlations, making it a highly suitable model for our proposed task.

\section{Performance Comparison for Scenarios 1 and 2}
\label{sec:scenario_1_and_2}

\begin{figure*}[h]
    \centering
    \includegraphics[width=\textwidth]{Figures/appendix_main_plot_1.pdf}
    \caption{Performance comparison on four graph datasets. The top four plots show the performance comparison of OPTUENT-OPT and OPTUENT-EXP with the baselines following scenario 1 for a fixed $\theta_{v} = 0.65$, and the bottom four plots show the performance comparison for $\theta_{v}$ sampled from the uniform distribution $\mathcal{U}(0.7, 0.85)$.}
    \label{fig:appendix_main_plot_1}
\end{figure*}

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{Figures/appendix_main_plot_2.pdf}
    \caption{Performance comparison on four graph datasets. The top four plots show the performance comparison of OPTUENT-OPT and OPTUENT-EXP with the baselines following scenario 2 for a fixed $\theta_{v} = 0.65$, and the bottom four plots show the performance comparison for $\theta_{v}$ sampled from the uniform distribution $\mathcal{U}(0.7, 0.85)$.}
    \label{fig:appendix_main_plot_2}
\end{figure*}

In Figures \ref{fig:appendix_main_plot_1} and \ref{fig:appendix_main_plot_2}, we present a comparative analysis of the baseline methods under scenarios 1 and 2 using the WebKB, Cora, Citeseer, and Pubmed datasets. In scenario 1, where $\theta_{v}$ is fixed at 0.65, and no belief propagation (BP) or random forest regression (RFR) is employed, we assess the Uniform and OPTKG baselines that treat instances as independent and identically distributed (i.i.d.). In scenario 2, which incorporates BP but excludes RFR, we evaluate the performance of GraphOBA-EXP and GraphOBA-OPT alongside the Uniform and OPTKG baselines, leveraging BP to enhance the propagation of labeling information. The results in Figure \ref{fig:appendix_main_plot_1} clearly demonstrate that our proposed policies, OPTUENT-OPT and OPTUENT-EXP, substantially outperform the Uniform and OPTKG baselines in the absence of BP and RFR, highlighting the importance of effective instance selection even without label propagation. In contrast, the performance shown in Figure \ref{fig:appendix_main_plot_2} illustrates a marked improvement when BP is utilized, confirming the findings of \cite{pmlr-v216-kulkarni23a} that propagating labeling information significantly enhances performance, even within constrained budgets.

\section{Performance Comparison for Setting with Fixed $\theta_{v}$}
\label{sec:different_theta}

Figure \ref{fig:webkb_cora} presents a performance comparison of the OPTUENT-OPT and OPTUENT-ENT policies against baseline methods for the WebKB and Cora datasets. Meanwhile, Figure \ref{fig:citeseer_pubmed}  illustrates similar comparisons for the Citeseer and Pubmed datasets under a fixed $\theta_{v}$ setting, with values set at 0.7, 0.75, 0.8, and 0.85. The results reveal a clear advantage for baselines employing belief propagation, which consistently outperform those treating instances as independent and identically distributed (i.i.d.). Furthermore, the integration of random forest regression significantly enhances the performance of these baseline methods. Notably, when examining the impact of different $\theta_{v}$ values, our proposed policies, OPTUENT-OPT and OPTUENT-ENT, demonstrate remarkable superiority, particularly at lower $\theta_{v}$ values where worker labels tend to be of poorer quality. This underscores the effectiveness of our policies in selecting optimal instances for labeling at each timestamp. Additionally, the ability to estimate instance correlations contributes to improved performance over individual workers across all $\theta_{v}$ values. As $\theta_{v}$ increases and the quality of worker labels improves, we observe a corresponding enhancement in the performance of baselines utilizing both random forest and belief propagation, further emphasizing the critical role that label quality plays in the efficacy of these models.

\begin{figure*}[h]
    \centering
    \includegraphics[width=\textwidth]{Figures/webkb_cora.pdf}
    \caption{Performance comparison on WebKB and Cora dataset. The top four plots and bottom four show the performance comparison of the proposed OPTUENT with baselines for the WebKB and Cora datasets, respectively, where the value of $\theta_{v}$ is set to $0.7$, $0.75$, $0.8$, and $0.85$.}
    \label{fig:webkb_cora}
\end{figure*}

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{Figures/citeseer_pubmed.pdf}
    \caption{Performance comparison on Citeseer and Pubmed dataset. The top four plots and bottom four show the performance comparison of the proposed OPTUENT with baselines for the Citeseer and Pubmed datasets, respectively, where the value of $\theta_{v}$ is set to $0.7$, $0.75$, $0.8$, and $0.85$.}
    \label{fig:citeseer_pubmed}
\end{figure*}

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{Figures/main_plot_stdev.pdf}
    \caption{Performance comparison on four graph datasets. The top four plots show the performance comparison between OPTUENT-EXP and GraphOBA-EXP+RFR following scenario 3 for a fixed $\theta_{v} = 0.65$, and the bottom four plots show the performance comparison for $\theta_{v}$ sampled from the uniform distribution $\mathcal{U}(0.7, 0.85)$. We plot the means and standard deviations for experiments obtained from different seed values of 11, 42, and 111.}
    \label{fig:main_plot_stdev}
\end{figure*}

\section{Adapting Proposed Approach}
\label{adapting}
The proposed approach is highly adaptable, effectively addressing both binary and multi-class labeling tasks in homogeneous and heterogeneous graphs. For multi-class tasks, we can seamlessly convert them into binary problems using a one-vs-all strategy. While our current framework infers edge labels from node pair labels, transitioning to heterogeneous graphs will require direct edge label annotations, which can be achieved through a Bayesian framework similar to that used for nodes. With these annotations in place, random forests can be employed to estimate edge labels and their associated uncertainties. Additionally, adapting Belief Propagation techniques for heterogeneous networks, such as those proposed by \cite{eswaran2017zoobp}, will further enhance the model's robustness.

\section{Limitations}
\label{sec:limitations}

Theoretically, the proposed approach can be applied to both binary and multi-class labeling tasks and to both homogeneous and heterogeneous graphs. While this work focuses on homogeneous graphs for binary labeling, the method is tailored for real-world crowdsourcing scenarios, utilizing simulated worker behavior due to the absence of actual crowd worker labels in our datasets. Details on adapting the method for multi-class labeling and heterogeneous graphs are provided in Appendix \ref{adapting}.