\documentclass[accepted]{uai2023} 
\usepackage[american]{babel}

\usepackage{natbib} 
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} 
\usepackage{booktabs} 
\usepackage{tikz} 


\usepackage{xr}

\makeatletter
\newcommand*{\addFileDependency}[1]{
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{kulkarni_590-supp}

\newcommand{\swap}[3][-]{#3#1#2} 
\usepackage{times}
\usepackage{latexsym}
\usepackage{xspace, subfigure}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\newtheorem{prop}{Proposition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\usepackage{lipsum}
\usepackage{hyperref}
\usepackage{adjustbox}
\usepackage{array}
\usepackage{balance}
\usepackage{multirow}
\usepackage{babel, caption}
\newcommand{\our}{\text{GraphOBA}\xspace}
\newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}
\usepackage[belowskip=-15pt,aboveskip=0pt]{caption}

\setlength{\intextsep}{10pt plus 2pt minus 2pt}

\title{Optimal Budget Allocation for Crowdsourcing Labels for Graphs}

\author[1]{\href{mailto:<aditkulk@iastate.edu>?Subject=Your UAI 2023 paper}{Adithya Kulkarni}{}}
\author[2]{\href{mailto:<mohnac@iastate.edu>?Subject=Your UAI 2023 paper}{Mohna Chakraborty}{}}
\author[3]{Sihong Xie}
\author[4]{Qi Li}

\affil[1,2,4]{%
    Computer Science Dept.\\
    Iowa State University\\
    Ames, Iowa, USA
}
\affil[3]{%
    Computer Science \& Engineering Dept.\\
    Lehigh University\\
    Bethlehem, Pennsylvania, USA
}

  
\begin{document}
\maketitle

\begin{abstract}

Crowdsourcing is an effective and efficient paradigm for obtaining labels for unlabeled corpus employing crowd workers. This work considers the budget allocation problem for a generalized setting on a graph of instances to be labeled where edges encode instance dependencies. Specifically, given a graph and a labeling budget, we propose an optimal policy to allocate the budget among the instances to maximize the overall labeling accuracy. We formulate the problem as a Bayesian Markov Decision Process (MDP), where we define our task as an optimization problem that maximizes the overall label accuracy under budget constraints. Then, we propose a novel stage-wise reward function that considers the effect of worker labels on the whole graph at each timestamp. This reward function is utilized to find an optimal policy for the optimization problem. Theoretically, we show that our proposed policies are consistent when the budget is infinite. We conduct extensive experiments on five real-world graph datasets and demonstrate the effectiveness of the proposed policies to achieve a higher label accuracy under budget constraints.

\end{abstract}
\section{Introduction}
\label{sec:intro}

Recently, crowdsourcing platforms like Amazon Mechanical Turk (mTurk)~\footnote{https://www.mturk.com/mturk/} have provided convenient and affordable ways to obtain labels for instances by employing a less expensive crowd of non-expert workers. For labeling data instances, each worker is incentivized with monetary rewards. Each instance can have a different labeling difficulty. Therefore, to properly learn the underlying true label of a hard instance, more workers may be needed compared to an easy instance. Given a pre-fixed budget, it is challenging to optimally allocate the labeling budget to a set of instances, as the allocation decisions have to be made in an online manner to gauge the labeling difficulty of the instances while spending the budget\footnote{The code can be found at https://github.com/kulkarniadithya/GraphOBA}.

Previous methods tackle the challenge from different directions. Some of the directions include how to assign instances to proper workers \citep{zheng2015qasca, zhang2015task}, how to set price for each worker label \citep{miao2022dynamically, dizaji2020robust}, and how to select an instance to query worker label \citep{sheng2008get, frazier2008knowledge, chen2013optimistic, li2016crowdsourcing}. Specifically, \cite{frazier2008knowledge, chen2013optimistic, li2016crowdsourcing} tackle the budget allocation challenge by proposing policies to choose the instances to label. The budget allocation problems are formulated as optimization problems where the objective is to either maximize the overall label accuracy \citep{frazier2008knowledge, chen2013optimistic} or maximize the labeling quality \citep{li2016crowdsourcing}. These studies consider each instance as i.i.d, assuming no dependencies among the instances, and solve the optimization problem using Bayesian Markov Decision Process (MDP) \citep{bellman1957markovian}. 

However, instances may be related, and the dependencies can be utilized for budget allocation optimization. Specifically, if two instances are dependent, knowing the label of one instance should help infer the label of the other instance. For example, considering citation networks \citep{bojchevski2017deep} where vertices are publications, and edges indicate citation relationship between publications. The connected publications are thus dependent and likely to belong to the same research area since publications generally cite publications from peers in the same field. Similar observations can be made for social networks \citep{leskovec2012learning}, trust networks \citep{kumar2016edge, kumar2018rev2}, etc.


In this work, we tackle the unique challenge of budget allocation on graphs where each edge connects two dependent instances. The instances (i.e., vertices in the graph) cannot be considered as i.i.d since the vertices are dependent. The vertices connected by edges can have a positive or negative pairwise vertex dependency. Due to this dependency and the graph structure, allocating a unit of labeling budget to a vertex need to be considered carefully, as obtaining a noisy label can influence the estimation of the labels of the connected vertices in the graph, especially for high-degree vertices that have more influence than low-degree vertices.

To find an optimal budget allocation policy for graph datasets, we adopt the Bayesian setting and formulate an optimization problem for online budget allocation for labeling instances connected in a graph. The objective is to maximize the overall label accuracy under budget constraints. The final expected accuracy is decomposed as a sum of \textit{stage-wise rewards} following the technique proposed in \cite{xie2012sequential}. The problem discussed by \cite{xie2012sequential} is an \textit{infinite-horizon} one which optimizes the stopping time, but \cite{chen2013optimistic} shows that the technique can also be applied for \textit{finite-horizon} MDP problem. Our proposed stage-wise expected reward computation considers the aggregated change in the distributions of labels of all vertices in the graph. To estimate the stage-wise expected reward of annotating a specific vertex,  we infer the label distributions of all vertices given the current noisy vertex labels at stage $t$ using belief propagation (BP) \citep{pearl2022reverend}. Specifically, we treat the noisy label(s) for each labeled vertex as the parameter of the prior Beta distribution. Modulated by the dependencies among the vertices, these noisy priors are then propagated to all the vertices in the graph to infer the posterior vertex distributions. We propose two approximate policies using the proposed stage-wise expected reward and prove that the policies are consistent; that is, when the budget $T$ goes to infinity, the accuracy converges to $100\%$ almost surely.


In summary, we made the following main contributions.
\begin{enumerate}
    \item We are the first to address the budget allocation problem in the crowdsourcing tasks on a graph of instances where each edge connects dependent instances.
    \item We propose a novel stage-wise reward function that considers the effect of worker labels on the whole graph at each stage. Using the novel stage-wise reward function, we propose two optimal approximate policies and theoretically prove that the policies are consistent. To propagate labeling information to other vertices, we model the dependency between vertices as a factor graph and adopt belief propagation.
  \item We conduct extensive experiments and ablation studies on the benchmark datasets and empirically validate the effectiveness of the proposed method. 
\end{enumerate}
\section{Related Works}
\label{sec:related_works}
The convenient accessibility and affordability of crowdsourcing platforms have motivated many research studies to develop new algorithms and designs for crowdsourcing tasks. 

Motivated by the labeling cost concern, some studies \citep{karger2014budget, zheng2015qasca, zhang2015task, wang2017obtaining, sameki2019buoca, liu2020budget, tu2020crowdwt, yu2020active} focus on instance assignments to workers. These studies jointly learn worker-instance distribution or the difficulty level of the instances to make a knowledgeable decision on which worker to assign.
Another line of studies \citep{zhang2015incentivize, gan2017incentivize, miao2022dynamically, dizaji2020robust, yin2015bonus} focuses on pricing for each worker label. Specifically, \cite{zhang2015incentivize, gan2017incentivize} design an online platform as a reverse auction where workers can bid on a task. \cite{zhang2015incentivize} consider a binary labeling task whereas \cite{gan2017incentivize} consider multi-class labeling task. \cite{miao2022dynamically, yin2015bonus} propose a dynamic pricing mechanism to incentivize workers to perform well.

Focusing on how to select instances to query worker labels, \cite{zhou2014optimal} proposes to use the aggregate regret to select $K$ arms with the highest expected rewards in a stochastic $n$-armed bandit game. However, this approach does not perform sequential instance selection. More related to our work, \cite{sheng2008get, li2016crowdsourcing, frazier2008knowledge, chen2013optimistic, raykar2014sequential} aim to learn a budget allocation policy for the sequence of instance selection. \cite{sheng2008get, li2016crowdsourcing} aim to maximize the number of labeled instances while maintaining the quality requirements under the given budget. \cite{sheng2008get} assumes that data quality is the same in all instances, whereas \cite{li2016crowdsourcing} assumes that data quality would be higher for easy instances. \cite{raykar2014sequential} aim to maximize a utility function with consideration of the pull market (i.e., workers may not accept jobs from requesters). \cite{frazier2008knowledge, chen2013optimistic} have similar goals to ours but consider each instance as i.i.d.. \cite{frazier2008knowledge} propose a knowledge gradient policy to sequentially select instances to label. \cite{chen2013optimistic} show that the knowledge gradient policy is not consistent and propose an optimistic knowledge gradient policy and show that it is a consistent policy when the budget is infinite. 

Unlike prior works, our goal is to obtain an optimal budget allocation policy for a graph of instances where edge connects dependent instances. Our proposed policy considers dependency between instances and the influence of each instance on other instances of the graph to obtain a policy that maximizes the overall label accuracy under budget constraints. To the best of our knowledge, we are the first to consider budget allocation policy for a graph of instances.

\section{Preliminaries}
\label{sec:preliminaries}
\subsection{Problem Formulation}
\label{problem_formulation}

Consider a graph $G = (V, E)$, each vertex $v_i \in V$ for $1 \leq i \leq N$ of the graph represents an instance whose true label is unknown, and the edge set $E$ contains edges connecting dependent instances. Since we consider binary labeling tasks, for each edge $e = (v_i, v_j) \in E$, let $C_{ij}$ be a $2 \times 2$ matrix representing the pairwise vertex dependency between the vertices $v_{i}$ and $v_{j}$.
Each vertex $v_i \in V$ is associated with a true label $l_i \in \{+1, -1\}$ for $1 \leq i \leq N$ and is characterized with the probability of being in class $+1$ denoted by $\theta_{v} \in [0, 1]$.
The label provided by a worker for any given vertex $v \in V$ at any given timestamp $t$ denoted by $y_{v_{t}}$ is drawn from the underlying label distribution, $y_{v_{t}} \sim Bernoulli (\theta_v)$. A label costs one unit of budget. Given a budget $T$, the goal is to maximize the overall label accuracy, which is measured on the inference of true labels of the vertices given the worker labels. Intuitively, with a larger budget, we can estimate $\theta_v$ more accurately for each vertex $v \in V$ and thus achieving better overall label accuracy.


\subsection{Belief Propagation}
\label{belief_propagation}
Belief propagation (BP)  \citep{pearl2022reverend} is a message-passing algorithm. For each edge $(v_i, v_j)$, two messages, $\underset{v_i \rightarrow v_j}{\mu}$ and $\underset{v_j \rightarrow v_i}{\mu}$, are propagated, one in each direction. A message from vertex $v_i$ to vertex $v_j$ essentially contains all the information from the subtree rooted at $v_i$. 

We convert the input graph $G$ into a factor graph $FG$ to apply belief propagation. A factor graph $FG=(V \cup F, E')$ is a bipartite graph with variables $V$ and factors $F$ as vertices and edges $E'$ connecting variables and factors. The variables $V$ of the factor graph are the $N$ instances of $G$, and for each edge $e=(v_i, v_j) \in E$, we add a factor vertex $F_e$. Each factor vertex $f \in F$ has a function $\phi_{f}$ that models the pairwise vertex dependency matrix $C_{ij}$. The factor vertex $F_e$ is connected to the variable vertices $v_i$ and $v_j$ using undirected edges.

For a factor graph $FG$, messages $\mu$ are passed between variable vertex $v \in V$ and factor vertex $f \in F$. The messages are computed differently depending on whether the vertex receiving the message is a variable vertex or a factor vertex.
\begin{align*}
    \underset{v \rightarrow f}{\mu}(x_v) = \prod_{f^* \in \mathcal{N}(v) \setminus \{f\}} \underset{f^* \rightarrow v}{\mu}(x_v), \numberthis 
    \label{eq1}
\end{align*}

\resizebox{.96\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    \underset{f \rightarrow v}{\mu}(x_v) = \sum_{x_{f}'= x_v, x_{v}'= x_v } \left( \phi_{f}(x_{f}') \prod_{v^* \in \mathcal{N}(f) \setminus \{v\}} \underset{v^* \rightarrow f}{\mu}(x_{v^*}')\right), \numberthis
    \label{eq2}
\end{align*}
\end{minipage}
}
where $\forall$ $v \in V$, $x_v \in \{+1, -1\}$ represents the labels that variable vertex $v$ can take, $\mathcal{N}(v)$ and $\mathcal{N}(f)$ represent the sets of neighboring vertices of $v$ and $f$, respectively.
 
In each iteration, an arbitrary vertex is chosen as a \textit{root}, and then messages are passed from leaf vertices in the graph $FG$ to the root (\textit{forward propagation}) and then back to the leaf vertices (\textit{backward propagation}). 
In both \textit{forward} and \textit{backward} propagations, the messages are initiated from variable vertices $v \in V$. The message from each vertex $v \in V$ is initialized with its prior/posterior probability $\omega_{v}$.
Following \textit{forward} and \textit{backward} propagation, the message between adjacent vertices is updated iteratively as per Eq. (\ref{eq1}) and Eq. (\ref{eq2}) until convergence. Furthermore, the messages are normalized in each step to avoid underflow. Upon convergence, the marginal probability of each variable vertex $v \in V$ is:
\begin{align*}
    P_{v}(x_v) \propto \omega_{v}(x_v) \prod_{j \in \mathcal{N}(v)} \underset{j \rightarrow v}{\mu}(x_v), \numberthis
    \label{eq3}
\end{align*}
where $\omega_{v}(x_v)$ is the prior/posterior probability of $x_v$.

\subsection{KG and OPTKG Policy}
\label{existing_approximate_policies}

Knowledge Gradient (KG)~\citep{frazier2008knowledge} and Optimistic Knowledge Gradient (OPTKG)~\citep{chen2013optimistic} provide policies to sequentially select instances to obtain worker labels. These methods consider each instance as i.i.d and formulate the budget allocation problem as an optimization problem. To find an optimal policy, these methods define a stage-wise reward function. At each timestamp, they select the next instance that maximizes the reward. Specifically, 
KG is a single-step look-ahead policy that greedily selects the next instance with the largest expected reward:

\resizebox{0.88\linewidth}{!}{
\begin{minipage}{\linewidth}

\begin{align*}
    v_t = \underset{v}{\mathrm{argmax}} \left( R(S^{t}, v)\ \dot=\ p_1*R_{1}(a_{v}^{t}, b_{v}^{t}) + p_2*R_{2}(a_{v}^{t}, b_{v}^{t})\right), \numberthis
    \label{eq4}
\end{align*}

\end{minipage}
}

where $a_{v}^{t}$ and $b_{v}^{t}$ represent the number of labels belonging to positive and negative classes, respectively, of vertex $v$ at timestamp $t$. $p_1 = \frac{a_{v}^{t}}{a_{v}^{t} + b_{v}^{t}}$ and $p_2 = \frac{b_{v}^{t}}{a_{v}^{t} + b_{v}^{t}}$ are posterior probabilities of $v_{t}$, and $R_{1}(a_{v}^{t}, b_{v}^{t})$, $R_{2}(a_{v}^{t}, b_{v}^{t})$ are the rewards of getting label $+1$ and $-1$, respectively, for vertex $v$.

OPTKG selects the next instance based on the optimistic outcome of the reward:

\resizebox{0.92\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    v_t = \underset{v}{\mathrm{argmax}} \left( R^{+}(S^{t}, v)\ \dot=\ \max(R_{1}(a_{v}^{t}, b_{v}^{t}), R_{2}(a_{v}^{t}, b_{v}^{t}))\right). \numberthis
    \label{eq5}
\end{align*}
\end{minipage}
}

Computationally, both KG and OPTKG have the time complexity $O(NT)$ and space complexity $O(N)$. However, their reward estimation considers each instance separately since instances are considered i.i.d.
\section{Methodology}\label{sec:methodology}

Our goal is to find an optimal budget allocation policy for graph datasets. The policy should properly estimate the underlying true label of each vertex, considering the dependency between vertices and the influence of each vertex on other vertices in the graph. We formulate our problem in the Bayesian setting. A detailed discussion is provided in Section \ref{setup}. According to the Bayesian setup, we define our task as an optimization problem that maximizes the overall label accuracy in the given budget $T$, which we discuss in Section \ref{objective_function}. The optimization problem is formulated into a Markov Decision Process to find the optimal policy. Then, we define our stage-wise expected reward that considers the probability distribution of every vertex in the graph after each iteration in Section \ref{optimal_policy}. We define two approximate policies for the problem and theoretically prove that the policies are consistent. A detailed discussion is provided in Section \ref{approximate_policy}.


\subsection{Bayesian Setup}
\label{setup}

Since the true labels of the vertices in the graph are unknown, we initialize $\theta_v$ with a Beta prior distribution Beta($\alpha, \beta$). Specifically, $\alpha$ and $\beta$ values are set to $0.1$. This initialization can be interpreted as having $\alpha$ positive and $\beta$ negative pseudo-labels for the vertex $v$ at the initial stage. 

We aim to model each worker label's effect on the whole graph. A worker label can update the marginal probabilities of vertices in the graph. Therefore, we define the state matrix $S^{t}$ as a $N \times 2$ matrix representing the marginal probabilities of the vertices in the graph. At each timestamp, depending on the choice of vertex and the label obtained, the marginal probabilities of vertices are updated, and we transition to the new state $S^{t+1}$.

We can observe that $S^t$ is a Markovian process because $S^{t+1}$ is completely determined by the current state $S^t$, the action $v_t$ and the obtained label $y_{v_{t}}$. Specifically, the change in marginal probability of vertices in graph $FG$ between timestamps is only due to the action $v_t$ and obtained label $y_{v_{t}}$. Moreover, suppose we choose $v_t$ to obtain a worker label in the current state $S^t$. In that case, we can calculate the state transition probability $Pr(y_{v_{t}} | S^t, v_t)$, which is the posterior probability that we are in the next state $S^{t+1}$ since each worker label at any given timestamp $t$ is drawn from the underlying label distribution. 
\begin{align*}
    Pr(y_{v_{t}} = +1 | S^t, v_t) = \mathbb{E}(\theta_{v_t} | S^{t}) = \frac{\alpha + a_{v}^{t}}{\alpha + a_{v}^{t} + \beta + b_{v}^{t}}, \numberthis
\end{align*}
where $a_{v}^{t}$ and $b_{v}^{t}$ are the number of positive and negative worker labels obtained for vertex $v$ till timestamp $t$ and $Pr(y_{v_{t}} = -1 | S^t, v_t) = 1 - Pr(y_{v_{t}} = +1 | S^t, v_t)$.
Following the above labeling process, a filtration $\{\mathcal{F}_t\}_{t=0}^{T-1}$ is defined, where $\mathcal{F}_t$ is the $\sigma$-algebra generated by the sample path ($v_0, y_{v_{0}}, ..., v_{t-1}, y_{v_{t-1}}$). The choice of the next vertex to label $v_t$ at timestamp $t$ is done after observing the historical labeling results up to the timestamp $t-1$. Therefore, $v_t$ is $\mathcal{F}_t$-measurable. Hence, the process of budget allocation is defined as a sequence of choices $\pi = (v_0, ...., v_{T-1})$.


\subsection{Objective Function}
\label{objective_function}
Our goal is to maximize the overall prediction accuracy once the budget is exhausted at timestamp $T$. The true label of each variable vertex $v \in V$ is inferred based on their marginal probability at timestamp $T$. Since the task is binary, we need to determine the positive set $H_{T}$ that maximizes the \textit{conditional} expected accuracy conditioning on $\mathcal{F}_T$:

\resizebox{0.92\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    H_{T} = \underset{H \subset \{1, ...., N\}}{\mathrm{argmax}} \mathbb{E} \left( \sum_{v \in H} \mathbf{1}(v \in H^{*}) + \sum_{v \notin H} \mathbf{1}(v \notin H^{*})| \mathcal{F}_T \right), \numberthis
    \label{eq6}
\end{align*}

\end{minipage}
}
where $H^{*}$ refers to the set of vertices with ground truth true labels $+1$, $H$ refers to the set of vertices with the estimated label $+1$, $H_T$ is the set H that maximizes Eq. (\ref{eq6}), and $\mathbf{1}(.)$ is the indicator function. For $0 \leq t < T$, the conditional distribution $\theta_v|\mathcal{F}_t$ is exactly the marginal probability calculated using Eq. (\ref{eq3}) that depends on the historical sampling results only through $S^{t}$. Therefore, we define
\begin{align*}
    P_{v}^{t}(+1) = Pr(v \in H^{*} | \mathcal{F}_t) = Pr(\theta_{v} \geq 0.5 | S^{t}). \numberthis
    \label{eq7}
\end{align*}
\cite{xie2012sequential} show that the final positive set $H_{T}$ can be determined by the Bayes decision rule.

Similar to \cite{chen2013optimistic}, we define the following proposition to solve Eq. (\ref{eq6}).
\begin{prop}
$H_{T} = \{v : P_{v}^{T}(+1) \geq 0.5\}$ solves Eq. (\ref{eq6}) and the expected accuracy on RHS of Eq. (\ref{eq6}) can be written as $\sum_{v=1}^{N} h(P_{v}^{T}(+1))$, where $h(z) = \max(z, 1-z)$.
\label{prop1}
\end{prop}

In order to find the optimal policy that maximizes the expected accuracy, the following optimization problem should be solved:

\resizebox{\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{flalign*}
    V(S^{0}) & \dot= \underset{\pi}{\mathrm{sup}} \ \mathbb{E}^{\pi} \left[ \mathbb{E} \left( \sum_{v \in H_{T}} \mathbf{1}(v \in H^{*}) + \sum_{v \notin H_{T}} \mathbf{1}(v \notin H^{*})| \mathcal{F}_T \right) \right] &\\
    & = \underset{\pi}{\mathrm{sup}} \ \mathbb{E}^{\pi} \left( \sum_{v=1}^{N} h(P_{v}^{T}(+1)) \right), \numberthis
    \label{eq8}
\end{flalign*}
\end{minipage}
}
where $\mathbb{E}^{\pi}$ represents the expectation taken over the sample paths ($v_0, y_{v_{0}}, ..., v_{T-1}, y_{v_{T-1}}$) generated by a policy $\pi$. The second equality is due to Proposition \ref{prop1} and $V(S^{0})$ is the value function at the initial state $S^{0}$. Any policy $\pi$ that attains the supremum in Eq. (\ref{eq8}) is the optimal policy $\pi^{*}$.

\subsection{Optimal Policy}
\label{optimal_policy}
To obtain the optimal policy $\pi^{*}$, we formulate the optimization problem in Eq. (\ref{eq8}) into a Markov Decision Process (MDP). The final expected accuracy is decomposed as a sum of \textit{stage-wise rewards} following the technique proposed in \cite{xie2012sequential}. The problem discussed by \cite{xie2012sequential} is an \textit{infinite-horizon} one which optimizes the stopping time, but \cite{chen2013optimistic} shows that the technique can also be applied for \textit{finite-horizon} problem. We consider the marginal probability of every vertex in the graph by taking the sum of marginal probabilities at each timestamp. Then, we define the reward function as the change in the sum of marginal probabilities between two timestamps.

\begin{prop}
The stage-wise expected reward is defined as:

\resizebox{.96\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    R(S^{t}, v_{t}) = \mathbb{E} (\sum_{k=1}^{N}h(P_{k}^{t+1}(+1)) - \sum_{k=1}^{N}h(P_{k}^{t}(+1)) | S^{t}, v_{t}), \numberthis
    \label{eq9}
\end{align*}
\end{minipage}
}
then the value function in Eq. (\ref{eq8}) becomes:
\begin{align*}
    V(S^0) = G_{0}(S^{0}) + \underset{\pi}{\mathrm{sup}} \ \mathbb{E}^{\pi} \left( \sum_{t=0}^{T-1} R(S^{t}, v_{t}) \right), \numberthis
    \label{eq10}
\end{align*}
where $G_{0}(S^{0}) = \sum_{k=1}^{N} h(P_{k}^{0}(+1))$ and any policy $\pi$ that attains the supremum is the optimal policy $\pi^{*}$.
\label{prop2}
\end{prop}

The detailed steps of the derivation are provided in Appendix \ref{proof_of_preposition}. Using Proposition \ref{prop2}, the maximization problem in Eq. (\ref{eq8}) is formulated as a $T$-stage MDP (Eq. (\ref{eq9})), which is associated with a tuple $\{T, \{\mathcal{S}^{t}\}, \mathcal{A}, \mathcal{P}^{t}, R(S^{t}, v_{t}) \}$. Here, $\mathcal{S}^{t}$, the state space at stage $t$, is all possible states that can be reached at stage $t$. Once a label $y_{v_{t}}$ is obtained for a variable vertex $v$ at timestamp $t$, the marginal probability of more than one variable vertex $v' \in V$ can change. Therefore, we have
\begin{align*}
    \mathcal{S}^{t} = \left\{ \{p_{1_v}^{t}, p_{2_v}^{t}\}_{v=1}^{N}: p_{1_v}^{t}, p_{2_v}^{t} \in [0, 1],  p_{1_v}^{t} + p_{2_v}^{t} = 1 \right\}. \numberthis
    \label{eq12}
\end{align*}
The action space $\mathcal{A} = \{1, 2, ..., N\}$ is the set of instances that could be labeled next. $\mathcal{P}^{t} = \{P_{1}^{t}, P_{2}^{t}, ..., P_{N}^{t}\}$ is the set of marginal probabilities at timestamp $t$ of each variable vertex $v \in V$ and $R(S^{t}, v_{t})$ is the expected reward defined in Eq. (\ref{eq11}). Moreover, due to the Markovian property of $\{S^{t}\}$, it is sufficient to consider a Markovian policy \citep{powell2007approximate}, where $v_t$ is chosen only based on the state $S^{t}$.


\subsection{Efficient Approximate Policy}
\label{approximate_policy}

Our goal is to choose the next vertex to obtain a worker label. Since our problem is an optimization problem and we model it as $T$-stage MDP (Eq. (\ref{eq9})), at each timestamp, we need to select the vertex that has the maximum stage-wise expected reward as the next vertex. At any given state $S^{t}$ at timestamp $t$, if any vertex $v \in V$ is chosen to obtain a worker label, it can get a label of $+1$ or $-1$. Therefore, to compute the stage-wise expected reward, we need to consider both possibilities. Let $R_{1}(S^{t}, v_{t})$, $R_{2}(S^{t}, v_{t})$ be the reward of getting label $+1$ and $-1$, respectively. Then, the expected reward is:
\begin{align*}
    R(S^{t}, v_{t}) = p_1*R_{1}(S^{t}, v_{t}) + p_2*R_{2}(S^{t}, v_{t}), \numberthis
    \label{eq11}
\end{align*}
 where $p_1 = \frac{\alpha + a_{v}^{t}}{\alpha + a_{v}^{t} + \beta + b_{v}^{t}}$ and $p_2 = \frac{\beta + b_{v}^{t}}{\alpha + a_{v}^{t} + \beta + b_{v}^{t}}$ are posterior probabilities of $v_{t}$. Therefore, the next vertex is the one that has the maximum expected reward:

\resizebox{.92\linewidth}{!}{
\begin{minipage}{\linewidth}

\begin{align*}
    v_t = \underset{v}{\mathrm{argmax}} \left( R(S^{t}, v_{t})\ \dot=\ p_1*R_{1}(S^{t}, v_{t}) + p_2*R_{2}(S^{t}, v_{t})\right). \numberthis
    \label{eq13}
\end{align*}

\end{minipage}
}

Following Eq. (\ref{eq13}), we can find the next vertex at each timestamp $0 \leq t < T$ and obtain the policy $\hat{\pi} = (v_0, ...., v_{T-1})$ which we call \our-EXP. Furthermore, similar to Eq. (\ref{eq5}), we can also choose the next vertex based on the optimistic outcome of the reward:
\resizebox{.96\linewidth}{!}{
\begin{minipage}{\linewidth}

\begin{align*}
    v_t = \underset{v}{\mathrm{argmax}} \left( R^{+}(S^{t}, v_{t})\ \dot=\ \max(R_{1}(S^{t}, v_{t}), R_{2}(S^{t}, v_{t}))\right). \numberthis
    \label{eq14}
\end{align*}

\end{minipage}
}

We call the policy $\pi^{o} = (v_0, ...., v_{T-1})$ obtained following Eq. (\ref{eq14}) \our-OPT. 

\our-EXP and \our-OPT require computation of $R_{1}(S^{t}, v_{t})$ and $R_{2}(S^{t}, v_{t})$. Therefore, we need to compute the change in the sum of marginal probabilities due to a new label $+1$ and $-1$, respectively. 

For the computation, we utilize the belief propagation (BP) algorithm to propagate labeling information throughout the graph. Each factor vertex $f \in F$ in graph $FG$ has a function $\phi_{f}$ that models the provided pairwise vertex dependency $C_{ij}$. To compute $R_{1}(S^{t}, v_{t})$\footnote{To compute the expected reward of each vertex $v \in V$, different temporary parallel environments similar to the main environment are created so that the main environment is not affected.}, assuming that BP converged at timestamp $t$, we first compute marginal probabilities of each variable vertex $v \in V$ following Eq. (\ref{eq3}). Then, $\sum_{v=1}^{N}h(P_{v}^{t}(+1))$ is computed using the marginal probabilities. 

Following \textit{forward propagation}, the messages are passed from leaf vertices of the factor graph $FG$ to the variable vertex $v_t$. Then, $v_{t}$ is assigned label $+1$ and the current posterior distribution Beta($\alpha + a_{v}^t, \beta + b_{v}^t$) of the variable vertex $v_t$ is updated. Since Beta is the conjugate prior of the Bernoulli, the posterior of $\theta_{v_t}$($\omega_{v}$) at the timestamp $t+1$ will be updated as Beta($\alpha + a_{v}^{t+1}, \beta + b_{v}^{t+1}$) = Beta($\alpha + a_{v}^{t} + 1, \beta + b_{v}^{t}$). Once $\omega_{v}$ is updated, the messages are \textit{backward propagated} from variable vertex $v_t$ to the leaf vertices of $FG$. BP may not converge in one iteration; therefore, \textit{forward} and \textit{backward} propagation steps are run multiple times but without assigning any new label to $v_{t}$. Upon convergence, Eq. (\ref{eq3}) is used to compute the updated marginal probability for each variable vertex $v \in V$ and compute $\sum_{v=1}^{N}h(P_{v}^{t+1}(+1))$. The difference between $\sum_{v=1}^{N}h(P_{v}^{t+1}(+1))$ and $\sum_{v=1}^{N}h(P_{v}^{t}(+1))$ is the reward $R_{1}(S^{t}, v_{t})$. Similarly, $R_{2}(S^{t}, v_{t})$ is computed where the assigned label is $-1$. 

$R_{1}$ and $R_{2}$ are computed for all vertices $v \in V$ and the next vertex is chosen following Eq. (\ref{eq13}) if \our-EXP is followed and Eq. (\ref{eq14}) if \our-OPT is followed. Once the vertex is chosen, a worker label is obtained for the vertex. The marginal probabilities of each vertex of the main environment are updated by propagating labeling information using belief propagation.

Given the pairwise vertex dependency ($C_{ij}$) among all pairs of adjacent variable vertices $v_{i}$ and $v_{j}$, the next theorem shows that the policies $\hat{\pi}$ and $\pi^{o}$ are consistent for the problem.
\begin{theorem}
Given the pairwise vertex dependency ($C_{ij}$) among all pairs of adjacent variable vertices $v_{i}$ and $v_{j}$ and $\alpha, \beta > 0$, the policies $\hat{\pi}$ and $\pi^{o}$ are consistent, i.e., as the budget $T$ goes to infinity, the accuracy will be $100\%$ almost surely ($i.e., H_T = H^{*}$).
\label{theorem1}
\end{theorem}
To prove the theorem, we show that the marginal probability of each vertex $v \in V$ is updated only due to its posterior probability and posterior probabilities of leaf vertices in the factor graph $FG$. Then, we show that the reward function is proportional to the change in the posterior probability of chosen vertex $v_t$, and both \our-EXP and \our-OPT will label each vertex infinitely many times as the budget goes to infinity. Since we consider workers reliable, if we label each vertex infinitely many times, we will converge to $\theta_{v}$ for each $v \in V$. Therefore, the accuracy will be $100\%$, almost surely implying that $\hat{\pi}$ and $\pi^{o}$ are consistent policies. The theorem is proved in Appendix \ref{proof_of_theorem}. 

\section{Experiments}
\label{sec:experiments}

\begin{table}[t]
\caption{ Statistics of the Datasets} 
\begin{tabular}{c|c|c|c|c|c}
\hline
\textbf{Dataset} & \textbf{\#Vertex} & \textbf{\#Pos} & \textbf{\#Neg} & \textbf{\#Train} & \textbf{\#Test} \\ \hline

Cora & 2708  & 1296  & 1412  & 2166 & 542  \\ \hline
Citeseer & 3312  & 1618  & 1694  & 2650 & 662  \\ \hline
Pubmed & 19717 & 7875 & 11842 & 15774 & 3943 \\ \hline
WebKB & 877 & 415 & 462 & 4705 & 1176 \\ \hline
Bitcoin & 5881 & 2914 & 2967 & 702 & 175 \\ \hline
\end{tabular}
\label{table: Datasets Statistics}
\end{table}

In this section, we evaluate two versions of our proposed approach, \our-EXP that chooses the next vertex to label following Eq. (\ref{eq13}) and \our-OPT that chooses the next vertex to label following Eq. (\ref{eq14}). We compare our proposed approaches with baselines on five benchmark graph datasets with different statistics and from different domains. More studies can be found in the Appendix \ref{ablation_studies}.


\subsection{Dataset}
The performance of \our is evaluated across five graph datasets. Three of the datasets, Cora, Citeseer, Pubmed \cite{bojchevski2017deep} are citation networks, Bitcoin \citep{kumar2016edge, kumar2018rev2} is a trust network between Bitcoin users, and WebKB \citep{craven1998learning} is a dataset that includes web pages from computer science departments of various universities. The datasets are multi-class, so we combine the classes to convert the datasets into binary-class datasets. Each dataset is split randomly into train and test sets in an 8:2 ratio. All policies can only label vertex in the train set. The labeling information of vertices in the train set is propagated to vertices in the test set using belief propagation. The statistics of the datasets can be found in Table \ref{table: Datasets Statistics}, and the pre-processing steps can be found in Appendix \ref{dataset_preprocessing}.


\subsection{Evaluation Metrics}
Since the goal of the proposed method is to maximize accuracy under budget constraints. We compare with the baselines using \textit{accuracy} as the evaluation metric. 


\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{kulkarni_590/Figures/cora.pdf}
    \caption{Performance comparison on datasets that follow homophily setting. The top three plots show the performance on the train set, and the bottom three plots show the performance on the test set due to the train set labeling information propagation using BP.}
    \label{fig:cora}
\end{figure*}

\subsection{Baseline Methods} 

We obtain instances to label following the baseline policies. Once the policies are obtained, the baselines are compared under two settings: (1) without BP and (2) with BP.

When the budget is lower than two times the number of instances, KG and OPTKG policies follow a round-robin policy and are equivalent. Since the budget for all our experiments is lower than two times the number of instances, we only compare with OPTKG. The following are the baselines\footnote{Though \cite{gittins2011multi} and \cite{nino2011computing} can be used for the problem, they are computationally expensive. The calibration method \cite{gittins2011multi} and \cite{nino2011computing} and state-of-the-art exact method \cite{nino2011computing} require $O(T^3)$ and $O(T^6)$ time and space complexity, respectively. Therefore, we do not compare them in our experiments.}. We compare with the following:

\begin{enumerate}
    \item \textit{Uniform:} Randomly sample one vertex from the train set of the graph.
    \item \textit{OPTKG}: Optimistic Knowledge Gradient \citep{chen2013optimistic} policy that follows Eq. (\ref{eq5}). 
    \item \textit{Uniform+BP}: Uniform policy and then apply belief propagation to propagate labeling information.
    \item \textit{OPTKG+BP}: Optimistic Knowledge Gradient \citep{chen2013optimistic} policy that follows Eq. (\ref{eq5}). Belief propagation is applied to propagate labeling information.
\end{enumerate}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.48\textwidth]{kulkarni_590/Figures/figure2.pdf}
    \caption{Performance comparison on WebKB and Bitcoin datasets. The top two plots show the performance on the train set, and the bottom two plots show the performance on the test set due to the train set labeling information propagation using BP.}
    \label{fig:webkb}
\end{figure}

\subsection{Experimental Settings}

For each dataset, we simulate reliable workers for the experiments. Since Cora, Citeseer, Pubmed, and WebKB datasets have features for vertices, a logistic regression model is trained on the whole dataset with the vertex features as input. Then, for each vertex $v$ in the dataset, the trained model provides a probability of being in class $+1$, which is used as $\theta_{v}$. Bitcoin dataset does not have features for vertices. Therefore, we use ground truth to decide $\theta_{v}$ for each vertex $v$. If the ground truth label of the vertex is $+1$, $\theta_{v}$ is set to $0.93$; otherwise, it is set to $0.07$. Finally, to obtain labels provided by reliable workers for each vertex $v$, random samples are drawn from $Bernoulli (\theta_v)$. 

The pairwise vertex dependency among adjacent vertices follows a homophily setting in Cora, Citeseer, and Pubmed datasets. Therefore, each factor vertex for these datasets is initialized with the same value following the homophily setting (probability of connect vertices having the same label is $0.51$ and different label is $0.49$).

WebKB and Bitcoin datasets do not follow the homophily setting. Therefore ground truth labels are used to infer pairwise vertex dependency. For a pair of adjacent vertices, the probability of both vertices having the same label is set to $0.9$ if the ground truth labels match; otherwise, it is set to $0.1$. The experiments on these datasets represent the scenario where we know the entire factor graph and the workers are reliable.

Ideally, as per Eq. (\ref{eq13}) and Eq. (\ref{eq14}), we should compute the expected reward for all vertices in the train set at each timestamp and choose the vertex with the maximum expected reward. However, belief propagation (BP) is computationally expensive since each iteration of BP has a time complexity of $O(|V \cup F|^2)$ on the factor graph $FG$, and BP can take several iterations to converge. Therefore, at each timestamp, we uniformly sample $10$ vertices from the train set to compute the expected reward and choose the vertex based on the policy. All the experiments are conducted with a random seed value of $11$, and the value of $\alpha$ and $\beta$ in the Beta prior distribution Beta($\alpha, \beta$) is set to $0.1$. We provide pseudo code for optimal policy $\pi^{*}$ computation for \our-OPT and \our-EXP in Algorithm \ref{alg:optimal_policy} in Appendix \ref{ablation_studies}.

\subsection{Results and Discussion}
\label{results_and_discussion}

In Figure \ref{fig:cora}, we compare the two versions of our proposed approach with the baselines on datasets that follow a homophily setting. Considering the results on the train set for different datasets, we can observe that since OPTKG follows a round-robin policy, its performance grows linearly with the budget till all the vertices in the train set are labeled, whereas the performance growth of Uniform policy follows near logarithmic curve.

We can also observe that applying belief propagation significantly improves the performance of both OPTKG and Uniform policies for all three datasets. The results suggest that propagating labeling information by considering dependency among adjacent vertices can help achieve significantly higher performance when the budget is low.

Comparing baselines with the proposed approaches, we can observe that \our-EXP outperforms the baselines for all three datasets, whereas \our-OPT comes in second. The performance improvement of \our-EXP and \our-OPT is significant when the budget is low. The results suggest that the vertices chosen by the proposed reward function are influential in the graph.

Considering the results on the test set for different datasets, since OPTKG and Uniform policies do not propagate labeling information, their results correspond to the initial prior distribution. Comparing the remaining baselines, we can observe that \our-EXP outperforms all the baselines for all three datasets, whereas \our-OPT comes second. The observation is similar to our observation for the train set, suggesting the importance of choosing the right vertex at each timestamp in labeling information propagation. Choosing the right vertex can result in a larger change in the marginal probabilities of vertices in the graph resulting in faster convergence to the true label distribution.

Figure \ref{fig:webkb} compares our proposed approaches with the baselines on WebKB and Bitcoin datasets. The results show that the proposed setup can achieve near $100\%$ accuracy with very little budget. Empirically, this suggests that when the ground truth factor graph is known, applying belief propagation can achieve nearly $100\%$ accuracy. The results suggest that knowing the dependency among adjacent vertices in the graph is important. From the figure, we can observe that the proposed approaches outperform the baselines for the WebKB dataset and achieve comparable performance on the Bitcoin dataset for both train and test sets.


We also conduct experiments without splitting the dataset into train and test sets. The results, along with the discussion, can be found in Appendix \ref{ablation_studies}.

\section{Ablation Studies}

We conduct ablation studies to explore the importance of various hyperparameters. All the experiments are conducted on the entire Cora dataset without splitting it into train and test sets. More studies can be found in the Appendix \ref{ablation_studies}. 

\begin{figure}[t]
    \centering
    \includegraphics[width=0.48\textwidth]{kulkarni_590/Figures/figure3.pdf}
    \caption{Ablation study results of experiments with different seed values (left plot) and sample sizes (right plot) on the Cora dataset. We plot the means and standard deviations for experiments obtained from different seed values (left plot), and for experiments with different sample sizes (right plot), we report the performance of \our-EXP.}
    \label{fig:ablation}
\end{figure}

The experiments in Figure \ref{fig:cora} and \ref{fig:webkb} are conducted with a random seed value of $11$. However, different seed values can result in different results. Therefore, we conduct experiments with different seed values ($1, 11, 42, 78, 96, 111$), and the mean and standard deviation are shown in Figure \ref{fig:ablation}. We observe from the results that the proposed approaches, including the baselines Uniform+BP and OPTKG+BP, have a larger variance when the budget is low, and the variance gradually reduces as the budget increases. However, compared to the baselines, the proposed approaches have lower variance, suggesting that the proposed approaches are more robust. The plot of average results of different seed initialization is similar to Figure \ref{fig:cora}, suggesting that \our-EXP and \our-OPT outperform baselines for different seed initialization.

The experiments in Figure \ref{fig:cora} and \ref{fig:webkb} are conducted with a sample size of $10$. Intuitively, one may expect to achieve better performance with a larger sample size since there are more candidate vertices to choose from. Therefore, we conduct experiments with different sample sizes ($10, 20, 30, 40, 50, 60$) using \our-EXP, and the results are shown in Figure \ref{fig:ablation}. The results confirm that policies with larger sample sizes tend to perform better, but all the policies converge to similar performance when the budget is sufficient. 

We also conduct ablation studies with different initialization of $\alpha$ and $\beta$ and different initialization for factors on the Cora dataset. Results and detailed discussions can be found in Appendix \ref{ablation_studies}.

\section{Conclusion}
\label{sec:conclusion}

This work addresses the budget allocation problem in the crowdsourcing tasks on a graph of instances where each edge connects dependent instances. We formulate the problem as an MDP and define the task as an optimization problem that maximizes the overall label accuracy under budget constraints.  We propose a novel stage-wise reward function to take advantage of the graph structure and dependency among vertices. We propose two optimal policies using this reward function and theoretically prove that the policies are consistent when the budget is infinite. To propagate labeling information throughout the graph, we convert the input graph into a factor graph and apply belief propagation. The results on five real-world graph datasets demonstrate the effectiveness of the proposed approach.
\section{Acknowledgements}\label{sec:acknowledgements}
Adithya, Mohna, and Qi were supported in part by the National Science Foundation under NSF grant IIS-2007941.
Sihong was supported in part by the National Science Foundation under NSF Grants IIS-1909879, CNS-1931042, IIS-2008155, and IIS-2145922. 

% References
\bibliography{kulkarni_590}
\end{document}
