\section{Introduction}
\label{sec:introduction}
Hiring expert annotators to label a graph can be both time-consuming and costly. A more budget-friendly alternative is to engage non-expert crowd workers. Since these workers lack specialized expertise, it is often recommended to conduct multiple rounds of labeling with different workers to enhance the overall quality. However, compensating crowd workers for each label can quickly escalate costs, especially when repeated labeling is required for every instance.

When operating within a limited data labeling budget, leveraging instance correlations can significantly improve the selection of instances for crowd worker labeling. If two instances are correlated, labeling one can provide valuable insights into the other, allowing the labeling information to propagate across the graph. This means that instead of labeling every instance, it is possible to select a smaller subset to minimize costs strategically. However, a key challenge arises: instance correlations are typically unknown when the graph lacks annotations. Estimating these correlations while simultaneously identifying the optimal subset of instances for labeling is a complex task.

Previous research on budget allocation has largely neglected the intricate challenge of simultaneously estimating instance labels and their correlations within a graph. Most studies have treated instances as independent and identically distributed (i.i.d.), overlooking the potential correlations that exist among them~\citep{frazier2008knowledge, chen2013optimistic, li2016crowdsourcing}. While recent work by~\citep{pmlr-v216-kulkarni23a} has attempted to address correlations between adjacent nodes, their approach is built on the assumption that these correlations are predetermined—a problematic stance, especially in non-homophily graphs. Moreover, the methodology presented by~\citep{pmlr-v216-kulkarni23a} fails to extend naturally to estimating instance correlations, as workers cannot directly annotate edges, and the accuracy estimations used do not apply to edge labels. In contrast, our study introduces a dynamic method for real-time estimation of instance correlations by leveraging labels provided by workers for adjacent nodes. This innovative approach facilitates a more sophisticated allocation of labeling budgets, accounting for both the correlations among instances and the estimation of their labels. By addressing these complexities, we enhance the effectiveness and efficiency of the labeling process. 

% Our experiments show that OPTUENT-EXP and OPTUENT-OPT outperform competitive baselines across four datasets in mid and high budget scenarios, while maintaining stable performance under low budgets.

Our goal is to reduce the uncertainties surrounding both instance labeling and correlation estimation. Since worker-provided labels do not directly annotate instance correlations, traditional accuracy metrics used in previous studies~\citep{pmlr-v216-kulkarni23a, chen2013optimistic} are inadequate for assessing annotation utility. Instead, we suggest focusing on measuring the uncertainty of labeling results. If a Graph Neural Network (GNN) model can accurately estimate the uncertainty of the graph, it can serve as a robust budget allocator. However, GNN models are known to struggle with the cold-start problem, and they require a sufficient budget to perform effectively~\citep{wu2020comprehensive}. Given our goal of leveraging label correlations to significantly reduce labeling costs, it is crucial to address these challenges. To that end, we adopt a Bayesian framework, formulating the budget allocation problem as an entropy optimization challenge. This approach aims to minimize uncertainty in both instance labeling and correlation estimation, ensuring that our strategies are not only effective but also cost-efficient.

To tackle this optimization problem, we decompose the expected uncertainty into a sum of \textit{stage-wise rewards}, inspired by the technique from~\citep{xie2012sequential}. Our innovative reward function captures the aggregated changes in uncertainty related to the labeling of all instances and the overall correlation estimation across the graph. The reward increases when a worker’s label leads to a greater reduction in uncertainty, ensuring that our approach is both effective and efficient.

To effectively propagate labeling information throughout the graph, we first need to estimate the instance correlations for all edges. However, we face a challenge: the absence of worker labels makes it difficult to gauge these correlations. To address this issue, we leverage the intuition that adjacent instances with similar features are likely to exhibit similar correlations. We propose training a random forest regression model (RFR)~\citep{breiman2001random}, using labeled pairs of adjacent instances to infer correlations for unlabeled pairs.

With the estimated instance correlations, we utilize belief propagation (BP)~\citep{pearl2022reverend} to disseminate labeling information across the graph. To achieve our objective of minimizing uncertainty, we introduce two strategic policies for selecting instances: OPTUENT-EXP, which prioritizes the instance with the highest expected reward, and OPTUENT-OPT, which focuses on the instance with the highest optimistic reward at each stage. The proposed approaches ensure a targeted and efficient allocation of resources for obtaining worker labels. Although our problem setting superficially resembles active learning, it diverges significantly in assumptions and goals. Unlike active learning, we assume access only to noisy, non-expert crowd workers, require repeated labeling to infer true label distributions, and jointly model uncertainty over both instance labels and their correlations. These distinctions render classical active learning methods unsuitable for our setting.

In summary, this paper makes several key contributions:
\begin{enumerate}
\item We are the first to estimate instance correlations between adjacent nodes and leverage these correlations to significantly reduce data labeling costs.
\item We introduce an entropy optimization framework that effectively models the uncertainties involved in both instance labeling and correlation estimation.
\item Our innovative reward function provides a comprehensive assessment of the aggregated uncertainty changes related to label estimation for instances and correlations across the entire graph.
\item We employ a random forest regression model to infer correlations for unlabeled pairs of adjacent nodes and utilize belief propagation to seamlessly disseminate labeling information throughout the graph.
\item Through extensive experiments on four real-world datasets, we empirically demonstrate the effectiveness of our proposed approach\footnote{https://github.com/kulkarniadithya/OPTUENT}.
\end{enumerate}
