%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{abbrvnat}%plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{algorithm}
%\usepackage{algorithmic}
\usepackage[noend]{algpseudocode}
% \usepackage[ruled]{algorithm2e} % For algorithms
\usepackage{amsfonts}
\usepackage{amsthm}
% \newtheorem{proof}{\bf Proof}[section]
\newtheorem{property}{\bf Property}[section]
\newtheorem{theorem}{\bf{Theorem}}[section]
\newtheorem{lemma}{\bf{Lemma}}[section]
\newtheorem{claim}{\bf{Claim}}[section]
\newtheorem{corollary}{\bf Corollary}[section]
\newtheorem{proposition}[theorem]{\bf{Proposition}}
\newtheorem{assumption}[theorem]{\bf{Assumption}}
\newtheorem{Definition}[theorem]{\bf{Definition}}
\newtheorem{remark}[theorem]{\bf{Remark}}
%\newmdtheoremenv{definition}[theorem]{\bf{Definition}}
\def \bx {\mathbf{x}}
\def \by {\mathbf{y}}

\renewcommand{\labelitemi}{$\triangleright$}
\newcommand{\rd}{\color{red}}
\newcommand{\squishlist}{
\begin{list}{$\bullet$}
  { \setlength{\itemsep}{0pt}
     \setlength{\parsep}{0pt}
     \setlength{\topsep}{0pt}
     \setlength{\partopsep}{0pt}
     \setlength{\leftmargin}{0em}
     \setlength{\labelwidth}{0em}
     \setlength{\labelsep}{0.2em} } }

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Content Sharing Design for Social Welfare in Networked Disclosure Game}
% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% example
\author[1]{\href{mailto:<feiran.jia@psu.edu>}{Feiran Jia}{}}
\author[2]{Chenxi Qiu}
\author[1]{Sarah Rajtmajer}
\author[1]{Anna Squicciarini}
% Add affiliations after the authors
\affil[1]{%
    Information Sciences and Technology,
    Pennsylvania State University,
    Pennsylvania, USA
}
\affil[2]{%
    Computer Science and Engineering,
    University of North Texas,
    Texas, USA
} 
  \begin{document}
\maketitle


\begin{abstract}
This work models the costs and benefits of personal information sharing, or self-disclosure, in online social networks as a networked disclosure game. In a networked population where edges represent visibility amongst users, we assume a leader can influence network structure through content promotion, and we seek to optimize social welfare through network design. 
Our approach considers user interaction non-homogeneously, where pairwise engagement amongst users can involve or not involve sharing personal information. We prove that this problem is NP-hard. 
As a solution, we develop a Mixed-integer Linear Programming algorithm, which can achieve an exact solution, and also develop a time-efficient heuristic algorithm that can be used at scale.
We conduct numerical experiments to demonstrate the properties of the algorithms and map theoretical results to a dataset of posts and comments in 2020 and 2021 in a COVID-related Subreddit community where privacy risks and sharing tradeoffs were particularly pronounced.
\end{abstract}

\section{Introduction}
\label{sec:intro}
Online social engagement allows users to connect with peers and other contributors through discussion \cite{brake2012, zafarani2013, golbeck2007dynamics}.  
Frequently, online discussions entail \emph{self-disclosure}, the voluntary sharing of 
personal information with others, which can include identifying or sensitive details such as location, age, gender, race, political affiliation, religious beliefs, and cognitive or emotional vulnerabilities~\cite{choi2015}. Despite apparent privacy risks, acts of self-disclosure 
can enhance social rewards \cite{hallam2017} by building trust, promoting empathy, increasing legitimacy and likeability, and deriving social support \cite{de2014mental,seiter2021social}. These trade-offs have become particularly evident over the past three years, during the Covid-19 pandemic \cite{nabity2020inside,blose2020privacy,umar2021self,amosun2021wechat}.

This work leverages a game-theoretic model to formalize tradeoffs between privacy risks and social rewards in an online social network. We define utility at the individual user level, and in doing so, we enable a notion of global \textit{social welfare}. We study social welfare optimization through strategic network design, operationalized through content promotion, as a game-theoretic problem.
 
Our use of game theory is motivated by the view of self-disclosure (SD) as inherently social and strategic behavior. We posit that, implicitly or explicitly, users experience a payoff for sharing behaviors, which is mediated by gains and (privacy) losses. In parallel, a social network provider benefits from engaged communities, where users' gratification and sense of community are maximized. 
In particular, we structure this problem as a Stackelberg game \cite{li2017review} in which a leader moves first, and followers move sequentially thereafter. The OSN platform is the game's leader; it can significantly influence the pairwise interactions defining the social network graph. OSN users are followers in the game; each decides whether and what to disclose about themselves, given their network of peers and perceived costs and benefits of sharing.


We formalize this problem as a bi-level programming problem and prove that this problem is NP-hard. As a solution, we linearize the problem to a mixed integer linear programming problem, which can obtain the optimal solution for a small network (e.g., 32 users). 
In addition, we propose a Greedy naive solution and a time-efficient heuristic algorithm for deployment at scale. 
The algorithm dynamically maintains a rank of users based on their estimated utilities and removes the nodes and associating edges with the most negligible contribution to social welfare.

Finally, we put the networked disclosure game (NDG) into practice through simulations using the traces collected from a real dataset of online conversations in a SubReddit for COVID-positive users. We first evaluate our algorithmic solutions' performance and computational efficiency using synthetic networks. Subsequently, we expand our experiments on real-world data, identifying parameter values and network criteria conducive to maximizing social welfare. 

\section{Related Work}
\label{sec:Related}
%\input{related_work.tex}
\noindent \textbf{Self-disclosure.}
Self-disclosure has been studied as an intentional and influenced behavior and has both intrinsic and extrinsic rewards \cite{Abramova2017,huang2016,wang2016modeling}. 
A growing body of work looks at SD both in online communications and particularly in the context of social networks. 
Work has shown that SD can support psychological well-being \cite{wingate2020influence,amosun2021wechat} and plays a role in building and maintaining relationships, social connectedness, and emotional support \cite{barak2007degree,aharony2016relationships,shih2015self,wingate2020influence,amosun2021wechat}. Increasingly, studies have explored self-disclosure in OSNs where users interact with others also seeking these benefits \cite{lee2013lonely,andalibi2016understanding,huang2016,pan2018you,trepte2018mutual,wingate2020influence,Abramova2017,hallam2017}. While several contributions have studied the impacts of self-disclosure in social networks, this has been mostly qualitative work or has relied on small datasets for quantitative observations \cite{huang2016,hallam2017}. To our knowledge, few studies have proposed computational measurement of the social benefits of personal information sharing. 
In \cite{griffin2019} authors proposed a public goods framework to measure self-disclosure benefits as a shared resource, where voluntary disclosure is framed as contributing to the richness of dialogue from which all participants benefit. This framing neglects individual privacy risks in the conceptualization of public good and is insufficient to capture the heterogeneity of individual disclosure decisions as they are influenced by peers.
Our work attempts to fill this important gap by addressing the group-level social benefits of self-disclosure.

Of note, recent work also has focused on automated models to scale self-disclosure labels in small manually-annotated data to larger samples \cite{Bak:2012,bak-etal-2014-self-disclosure,houghton,Umar:2019}. These models employ highly curated dictionaries and extensive feature engineering, which limit the inference process and performance on unseen data. Other recent studies have used transfer-learning techniques on NLP models for labeling self-disclosure in text \cite{pant2020}. We employ state-of-the-art BERT-based models to label our dataset for analysis in this work. Our emphasis is not on detecting utterances of self-disclosure in online conversations but rather on the impact these messages have on the welfare of their broader community.  

\noindent\textbf{Social Influence}. Social networks are extensively used for information diffusion, with concrete examples such as viral marketing~\cite{Domingos:2001:MNV:502512.502525} and targeted advertisement~\cite{target}. These applications have garnered significant attention from the social network analysis community, specifically, the topic of influence maximization \cite{Kempe:2003:MSI:956750.956769}. At a high level, influence maximization aims to select a small subset of \textit{seeds} acting as the source of influence with which the influence is maximized under some information diffusion model, e.g., Independent Cascade \cite{wang2012scalable}, Linear Threshold \cite{goyal2011simpath}, where the influence is measured by the number of affected nodes in the network. 
In contrast to a diffusion setting, \cite{irfan2014influence} propose the Linear Influence Game (LIG) as a simultaneous move game. Our inner networked disclosure game features analogous utility functions to LIG, albeit with binary actions confined to $\{0,1\}$, positive weights, and undirected network structure. 
This enables us to obtain a polynomial time solution for identifying a Pure Strategy Nash Equilibrium (PSNE). 
Compared with the influence works, our primary goal is to promote content and design network structure, which are more similar to recommendation~\cite{coro2019recommending, coro2021link}.


\section{Problem Formulation}
\label{sec:problem}
%\input{problem.tex}
We assume the conversational context of an OSN. Namely, users interact and share "with" one another at a given time. These interactions can be represented as a network graph. We assume that all users within that context receive a (measurable) reward for participation. In practice, the measurement of reward is likely to be platform-dependent. Here, we consider measures of social engagement, including likes, shares, the sentiment of replies, and similar. We assume that the act of self-disclosing comes at a privacy cost to the individual user. This cost can be instantiated in various ways, i.e., proportional to the size of neighbors and other agents, and can be set to any [0,1] value \cite{Liu:2010}. In our experiments, as we return in Section \ref{subsec:realexp}, this cost is assumed constant and learned from users' interactions with peers. 

From these assumptions, we cast the problem as a binary networked game.
Specifically, users who share personal information incur an individual cost to privacy but receive social or emotional rewards.  

Central to this formulation is the notion of social welfare \cite{zheleva2009}, defined as 
the total utility of all participating users in the game, 
where utility is measured as the difference between cost and reward. From this perspective, we explore interventions, e.g., content ranking and recommendation, that OSN platforms might incorporate to support greater social welfare. 


% \subsection{Model} \label{sec:model}
Following, we introduce the mathematical models of (1) the \emph{Networked Disclosure Game} and (2) the \emph{Content Promotion Networked Design}. 


\noindent \textbf{Networked Disclosure Game (NDG).} 
The \emph{networked disclosure game (NDG)} is defined on a content promotion network $\mathcal{G} = (\mathcal{V}, \mathcal{E})$, where $\mathcal{V} = \{1, ..., n\}$ are the users, and $\mathcal{E} \subseteq \mathcal{V}\times \mathcal{V}$ is a reflection of whether two users can see each other's content (i.e. the interdependencies among the players' utilities).
We use a binary decision variable $x_i \in \{0, 1\}$ ($i =1 ,..., N$) to indicate user $i$'s strategy to self-disclose, i.e.,  $x_i = 1$ if %average self-disclosure of user $i$ is higher than 1; 
user $i$ posts content containing self-disclosure; otherwise $x_i = 0$. We let $\mathbf{x} = (x_1, ..., x_N)$ be the whole action profile. 

As noted above, users engage in self-disclosure to obtain benefits, e.g., emotional or informational support, broadly represented as social connectivity. Our model invokes the concepts proposed by social penetration theory, suggesting that mutual information sharing is a cornerstone for building relationships. In essence, users can reap the benefits from their neighbors only when they indulge in self-disclosure.
Specifically, given a pair of users $i$ and $j$ sharing information with each other ($x_i = x_j = 1$, and 
$(i,j) \in \mathcal{E}$
), the benefit that user $i$ receives from user $j$ is  denoted by a  
weight $w_{i,j}$ 
from $i$ to $j$\footnote{For simplicity, we disregard the impact of negative social interactions and assume that interactions generally provide some benefit. Note that since trust and influence are not always symmetric,  $w_{j,i}$ and $w_{i,j}$ can be different values,  connected with the benefit functions $g_i$ and $g_j$ separately}. 
As such, we calculate the total benefit that user $i$ receives from his/her neighbors $\mathcal{N}_i = \{j|(j,i)\in \mathcal{E}\}$
as 
$g_i(x_i, \mathbf{x}_{-i}) = x_i \sum_{j\in \mathcal{N}_i} w_{j,i}x_{j}$ 
, where $ w_{j,i}$ denotes the potential impact user $j$  has to user $i$, and $\mathbf{x}_{-i}$ denotes the strategy profiles except for user $i$.


On the other hand, self-disclosure comes at  
a potential privacy cost for a user, i.e., %the user may disclose his/her
personal information that is voluntarily shared may result in increased privacy risks\footnote{Note that for the purposes of this model, risks may not be immediately tangible or easy to quantify. This is irrelevant since we model users' own decision process and %privacy inclinations, 
perceived rather than objective risks.}. Therefore, we define the utility for each user $i$ as 
\begin{equation}
\label{eq:overallSW}
\textstyle   U_i(x_i, \mathbf{x}_{-i}|\mathcal{G}) = 
%x_i\sum_{j} w_{j,i}x_{j}p_{j,i}}_{\mbox{benefit}}
g_i(x_i, \mathbf{x}_{-i}|\mathcal{G})
-  c_ix_i
\end{equation}
where $c_i >0$ is the cost of user $i$ for  disclosing (e.g. perceived privacy loss). % and $w_{j,i}$ denotes the influence from node $j$ to $i$. 
To normalize the weights and the costs, we assume $\sum_{j} w_{j,i} \leq 1$ and $c_i \in [0,1]$.

In this paper, we adopt the \textit{pure strategy Nash Equilibrium (PSNE)} as our solution concept, i.e., $U_i(x_i, \mathbf{x}_{-i}| \mathcal{G}) > U_i(1- x_i, \mathbf{x}_{-i}| \mathcal{G})$ $\forall i$. Moreover, we assume that each user $i$ breaks ties in favor of disclosing, i.e., user $i$ chooses to disclose when  
\begin{equation}
\label{eq:threshold}
\textstyle U_i(1, \mathbf{x}_{-i}| \mathcal{G}) \geq U_i(0, \mathbf{x}_{-i}| \mathcal{G}), \mbox{ or simply } \sum_{j\in \mathcal{N}_i} w_{j,i} x_j \geq c_i. 
\end{equation}
where Equ. (\ref{eq:threshold}) is called the \textbf{threshold condition}. 
\begin{theorem}
\label{clm:tc}
The strategy profile $\mathbf{x}$ is a PSNE if and only if (1) everyone who invests satisfies the \textbf{threshold condition} and (2) other agents do not satisfy \textbf{threshold condition}, i.e. $\sum_{j\in \mathcal{N}_i} w_{j,i} x_j < c_i$ (Detailed proof can be found in the supplementary material).
\end{theorem}
We define social welfare as the sum utility of all the users.
\begin{eqnarray}
\textstyle SW(\mathbf{x}|\mathcal{G}) = \sum_{i\in \mathcal{V}}U_i(x_i, \mathbf{x}_{-i}|\mathcal{G})% \\
% &=& \sum_{i\in \mathcal{V}}\left(g_i(x_i + \mathcal{N}_i) - c_ix_i\right)\\
% &=& \sum_{i\in \mathcal{V}}g_i(x_i + \mathcal{N}_i) - \sum_{i\in \mathcal{V}}c_ix_i
\end{eqnarray}
%which is the contribution to social welfare made by user $i$. 
This formulation follows that of \cite{yu2020computing}, which similarly seeks to find social welfare-optimal equilibrium in the context of a binary networked public goods game. \looseness = -1

\noindent \textbf{Content Sharing Network Design.} 
OSN platform providers control whose content to show (or promote) among a set of users $\mathcal{V} = \{1,...,N\}$. 
The decision to connect or promote connections between two users $i$ and $j$ can be considered a recommendation. We let $\mathcal{E} \subseteq \mathcal{V}\times \mathcal{V}$ denote the set of recommended connections and $\lambda_{i,j}$ denote the cost of recommending each $e_{i,j} \in \mathcal{E}$. 

Additionally, we assume the platform has a finite set of resources or a budget $B$ to promote users' interactions, i.e., $\sum_{e_{i,j} \in \mathcal{E}}\lambda_{i,j} \leq B$, where the recommended connections are limited to an \emph{input edge set} $\mathcal{E}^{\text{in}}$, i.e., $\mathcal{E}\subseteq \mathcal{E}^{\text{in}}$.
For instance, in practice, the OSN's action space is restricted by its infrastructure, which might only have the capability to promote a limited set of users and their content. %, as well as user time and attention. % This is denoted as a budget
Our goal is to maximize social welfare by solving the \emph{Optimal Self-Disclosure 
 Sharing Problem (OSDSP)}, defined as follows.
 

\begin{Definition}
(OSDSP) Given an input graph $\mathcal{G}^{\text{in}} = (\mathcal{V}, \mathcal{E}^{\text{in}})$, the user's utility functions $U_i(\cdot|\cdot), ~\forall i$, and a budget $B \geq 0$, OSDSP in one shot can be defined as finding an optimal edge set 
%{\rd 
$\mathcal{E} \subseteq \mathcal{E}^{\text{in}}, ~\sum_{e_{i,j}\in \mathcal{E}} \lambda_{i,j}\leq B$, 
%$\mathcal{E} \subseteq \mathcal{E}^{\text{in}}, ~ |\mathcal{E}| \leq B$, 
such that the PSNE of the NDG defined on $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ maximizes the social welfare. Here, when there are multiple PSNEs, only the one with the highest maximized social welfare is considered.



\end{Definition}


\begin{theorem}
\label{thm:NP}
The OSDSP problem is NP-hard.  
\end{theorem}
\begin{proof}
We prove this theorem by constructing a polynomial time reduction from the NP-complete problem \emph{subset sum} problem \cite{Algorithm} to OSDSP. Before the proof, we introduce the decision problems of both subset sum and OSDSP: 

\textbf{Definition}: The decision problem of subset sum.
\newline \emph{Instance}: Given a set of $M$ positive integers $\mathcal{A} = \{a_1, ..., a_M\}$ and a target sum value $A$. 
\newline \emph{Question}: Whether exists a subset $\mathcal{A}'\subseteq \mathcal{A}$, such that the sum of the elements in $\mathcal{A}'$ is equal to $A$, i.e., $\sum_{a_i \in \mathcal{A}'} a_i = A$. 

\textbf{Definition}: The decision problem of OSDSP. 
\newline   \emph{Instance}: Given an input network $\mathcal{G}^{\text{in}} = (\mathcal{V}, \mathcal{E}^{\text{in}})$, impact coefficients $\left\{w_{i,j}\right\}_{e_{i,j} \in \mathcal{E}^{\text{in}}}$, recommendation costs (edge costs) $\left\{\lambda_{i,j}\right\}_{e_{i,j} \in \mathcal{E}^{\text{in}}}$, cost coefficients $\left\{c_i\right\}_{i\in \mathcal{V}}$, a constant $U$, and a budget limit $B$.
\newline \emph{Question}: Whether exists a solution % $\left(\mathbf{x}, \mathbf{P}\right)$ 
$\left(\mathbf{x}, \mathcal{G}\right)$ 
such that $SW(\mathbf{x}|\mathcal{G})\geq U$ and $\sum_{e_{i,j} \in \mathcal{E}}\lambda_{i,j} \leq B$.


\begin{figure}[t]
    \centering
    \begin{minipage}{.50\textwidth}
        \centering
      \includegraphics[width = 0.80\linewidth]{./images/NPhard.pdf}
        \caption{The OSDSP instance in the NP hard proof}
        \label{fig:NPhard}
    \end{minipage}
\end{figure}

Given any subset sum problem instance, we can construct the following OSDSP instance: 
\newline 1) There are $2M$ users (nodes) in the network;
\newline 2) As Fig. \ref{fig:NPhard} shows, $\mathcal{G}^{\text{in}}$ is composed of $M$ disjoint sub-graphs $\mathcal{G}^{\text{in}}_1, ..., \mathcal{G}^{\text{in}}_M$, where each $\mathcal{G}^{\text{in}}_i$ is composed of two nodes $v_{2i-1}$ and $v_{2i}$ connecting by an edge $e_{2i-1, 2i}$ with edge cost $\lambda_{2i-1, 2i} = a_i$; 
\newline 3) In each $\mathcal{G}^{\text{in}}_i$, the two nodes $v_{2i-1}$ and $v_{2i}$ have their costs $c_{2i-1} = c_{2i} = 1.5a_i$, and  $w_{2i-1,2i} = w_{2i,2i-1} = 2a_i$; 
\newline 4) $U = A$ and $B = A$. 

This reduction process from subset sum to OSDSP is performed in polynomial time. Before showing the correctness of this reduction, we first give Lemma \ref{lem:} as a preparation (the detailed proof can be found in the supplementary material):
\begin{lemma} 
\label{lem:}
For each sub-graph $\mathcal{G}^{\text{in}}_i$, there are only two possible PSNEs:  (1) both nodes $v_{2i-1}$ and $v_{2i}$ self-disclose, or  (2) neither nodes disclose itself. 
\end{lemma}
We show the correctness of the polynomial reduction, i.e., \emph{a solution exists for the subset sum instance if only if there exists a feasible solution for the OSDSP instance}. Note that in Case (1) of Lemma \ref{lem:}, the total social welfare of $\mathcal{G}^{\text{in}}_i$ is 
$\frac{a_i}{2}+\frac{a_i}{2} = a_i$ and the total cost of the promoted edges is $a_i$. In Case (2), the total social welfare of $\mathcal{G}^{\text{in}}_i$ is 0 (no node disclosed itself), and the total promoted edge cost is $0$.  



% The proof of Lemma \ref{lem:} can be found in Appendix A. 

\noindent $\Rightarrow$: Assuming exists a solution $\mathcal{A}' = \{a_{i_1}, a_{i_2}, ..., a_{i_m}\}$ for the subset sum instance, i.e., 
$\sum_{l=1}^m a_{i_l} = A$, we can construct a feasible solution of the OSDSP instance: For each sub-graph $\mathcal{G}^{\text{in}}_{i_l}$ ($l = 1,..., m$), we promote the edge $e_{2i-1, 2i}$ and let the nodes disclose, i.e., $x_{2i-1} = x_{2i}= 1$. The total social welfare in $\mathcal{G}^{\text{in}}_{i_l}$ is equal to $\sum_{l=1}^m a_{i_l}  = A \geq U$ and the total cost $\sum_{l=1}^m a_{i_l}  = A \leq B$. 


\noindent $\Leftarrow$: Assuming exists a solution $\left(\mathbf{x}, \mathcal{G}\right)$ in the OSDSP instance, where the edges (nodes resp.) in the sub-graphs $\mathcal{G}^{\text{in}}_{i_1}, \mathcal{G}^{\text{in}}_{i_2}, ..., \mathcal{G}^{\text{in}}_{i_m}$ are promoted (disclosed resp.). Hence,  
\begin{eqnarray}
SW(\mathbf{x}|\mathcal{G}) = \sum_{l=1}^m a_{i_l} \geq U = A \\ \sum_{e_{i,j}\in \mathcal{E}} \lambda_{i,j} = \sum_{l=1}^m a_{i_l} \leq B = A
\end{eqnarray}
indicating that $\sum_{l=1}^m a_{i_l} = A$. Hence $\mathcal{A}' = \{a_{i_1}, ..., a_{i_m}\}$ is a feasible solution of the subset sum instance. 
\end{proof}


\section{Algorithm Design}\label{sec:algorithms}
%\input{algorithm.tex}

Due to the hardness of OSDSP (Theorem \ref{thm:NP}), in this section, we aim to design algorithms that can achieve near-to-optimal solutions with high time efficiency.  

OSDSP can be decomposed into two subproblems:
\newline \textbf{Inner problem} - Given a promotion network $\mathcal{G}$, how to find an optimal PSNE $\mathbf{x}$ of NDG to maximize the social welfare. The inner problem establishes a relationship between the promotion network $\mathcal{G}$ and its maximum social welfare, denoted by $\sigma(\mathcal{G})$ (formally defined in Definition \ref{def:sigmafunc}). 
\newline \textbf{Outer problem} - Given $\sigma(\mathcal{G})$, how to find the optimal promotion network $\mathcal{G}^*$ to maximize the social welfare while satisfying the budget limit.
Like \cite{coro2021link}, we assume that promoting each piece of content has an equal cost for simplicity since in reality the amount of online space each content occupies is similar when displayed.
Then the outer problem can be formulated as $\mathcal{G}^* = \arg \max_{|\mathcal{E}|\leq B} \sigma(\mathcal{G})$. 

To solve the inner problem, we design an iterative algorithm in Section \ref{subsec:inner} and prove that the algorithm can achieve the optimal PSNE. For the outer problem, we prove that it is super modular under certain conditions ($\frac{c_i}{w_{j,i}} \leq 1, \forall (i, j) \in \mathcal{E}^{in}$) in Section \ref{subsec:outer}. Based on the insights obtained in Section \ref{subsec:inner} and Section \ref{subsec:outer}, we design two time-efficient heuristic algorithms to solve OSDSP in Section \ref{subsec:heuristic}. Finally, to evaluate how close the heuristic can achieve the optimal of OSDSP, in Section \ref{subsec:milp}, we design a \emph{mixed integer linear programming (MILP)} based method that can achieve the optimal solution in a relatively small scale, as a benchmark of the heuristic in Section \ref{sec:exp}. 

\subsection{Inner Problem: Solving the NDG}
\label{subsec:inner}

Given a promotion network, the NDG is likely to have multiple PSNEs. For example, $\bx = \mathbf{0}$ is a PSNE, denoted as a trivial PSNE. We are interested in the optimal PSNE that maximizes social welfare in the inner problem.

To find the optimal PSNE, we design an iterative algorithm called the \emph{MaxInvest} algorithm. The pseudo-code is shown in Algorithm \ref{alg:MaxInvest}. Here, we use the superscript $^{(k)}$ to denote the values set/derived in iteration $k$. 


The algorithm starts by initializing a \emph{potential benefit value} $h^{(0)}_i = \sum_{j\in \mathcal{N}_i} w_{j,i}x^{(0)}_j$ for each node $i$ ($i = 1, ..., n$) by assuming each node $i$ discloses, i.e., $x^{(0)}_i = 1$ (line 2). Then the algorithm traverses each code $i$ and deactivate it (by setting $x^{(0)}_i = 0$) when its threshold condition $h^{(0)}_i \geq c_i$ is not satisfied (line 3). Since deactivating nodes will also decrease their neighbors' potential benefit values, in the next part (line 4-12), we iteratively check the threshold conditions of the deactivated nodes' neighbors: We maintain a queue $Q$ and in each iteration $k$, we pop the front node off the queue and deactivate it (line 6). We then check its neighbors' threshold conditions (line 10) and push the ones violating the conditions onto $Q$ (line 11). This process is repeated until the queue is empty. 

\begin{algorithm}[h]
	\caption{MaxInvest}\label{alg:MaxInvest}
	\begin{algorithmic}[1]
	\Procedure{MaxInvest}{$\mathcal{E}$}       
		\State \textbf{Initialization:} an empty queue $Q$, a potential benefit array $\left\{h^{(0)}_1,...,h^{(0)}_n\right\}$, where $h^{(0)}_i = \sum_{j\in \mathcal{N}_i} w_{j,i}x_j$; a strategy profile $\mathbf{x}^{(0)} = \left\{x_1^{(0)}, ..., x_n^{(n)}\right\}$, where $x^{(0)}_i = 1$, $\forall i \in V$.
        % \State Calculate $h_i = \sum_{j\in \mathcal{N}_i} w_{j,i}x_j$ for each agent. 
        \State Check the \emph{threshold condition}, i.e. $h^{(0)}_i \geq c_i$ for each node and push those who violate the condition onto $Q$;
        \State Iteration index $k \leftarrow 1$; 
        \While{$Q$ is not empty}
        \State Pop the front node $i$ off $Q$; 
        \State Deactivate node $i$ by set $x^{(k)}_i = 0$; 
        
        \For{each node $i$'s neighbour node $j$}%  \fj{remove with $x_j = 1$}}
        \State Update the potential benefit $h^{(k)}_j = h^{(k-1)}_j - w_{i,j}$; 
        \If{$h^{(k)}_j < c_j$ and $x_j^{(k-1)} = 1 $}
            \State Push node $j$ onto $Q$;
        \EndIf 
        \EndFor
        \State $k \leftarrow k +1$; 
        % \State Check the updated \emph{threshold condition} for each neighbour of node $i$ and push those who disclose but violate the condition onto $Q$.
    \EndWhile
	\EndProcedure
	\State \textbf{Return:} $\textbf{x}^{(k)}$
	\end{algorithmic}
\end{algorithm}

Next, we prove that the users' action profile $\mathbf{x}$ returned by MaxInvest is both \emph{feasible} (in Theorem \ref{thm:PSNE}), i.e., $\mathbf{x}$ is a PNSE, and \emph{optimal} (in Theorem \ref{thm:optimal}), i.e., it achieves the maximum social welfare. \looseness = -1


\begin{lemma}
\label{lem:mono}
Let $h_i^{(k)}$ denote $h_i$ in the $k$th iteration in MaxInvest. For any iteration $k_1 < k_2$, we have $h_i^{(k_1)} \geq h_i^{(k_2)}$, i.e., the potential benefit of each node $i$ is non-increasing over iterations in MaxInvest. 
\end{lemma}
\begin{proof}
$h_i^{(k_1)} - h_i^{(k_2)} = \sum_{j\in \mathcal{N}_i} w_{j,i}\left(x^{(k_1)}_j - x^{(k_2)}_j\right) \geq 0$.  
\end{proof}

\begin{theorem} \label{thm:PSNE}
$\bx$ returned by MaxInvest is a PSNE. 
\end{theorem}

\begin{proof}
Let $K$ be the total number of iterations of MaxInvest. We need to prove that (1) for any $x_i = 1$, $h_i^{(K)} \geq c_i$, and (2) for any $x_i = 0$, $h_i^{(K)} < c_i$. 

\noindent (1) For the sake of contradiction, we assume that there exists $x_i = 1$ such that $h_i^{(K)} < c_i$. First, $h_i^{(0)} \geq c_i$; otherwise node $i$ is removed at initialization. According to Lemma \ref{lem:mono}, there must exists an iteration $k$ ($k \leq K-1$) such that $h_i^{(k)} \geq c_i$ and $h_i^{(k+1)} < c_i$. The decrease of $h_i$ at the iterations $k+1$ has to be caused by the leave of at least one of $i$'s neighbors, say node $v_j$. Then, when removing $v_j$, as $v_j$'s neighbor, $i$'s updated threshold condition has to be checked, which is $h_i^{(k+1)} < c_i$, and hence $v_j$ should have been added to $Q$ and removed by the algorithm terminates, which is contradicted with the assumption that $x_i = 1$. 

\noindent (2) For each $x_i = 0$, we let $k_i$ denote the iteration when node $i$ is removed, which indicates that $h_i^{(k_i)} < c_i$. According to Lemma \ref{lem:mono}, $h_i^{(K)} \leq h_i^{(k_i)} < c_i$. The proof is completed. 
\end{proof}


\begin{lemma} \label{lem:maxInvestRel}
Suppose $\bx$ is the profile returned by the MaxInvest. For any PSNE profile $\bx'$, we have $\bx' \leq \bx$.%$I(x') \subseteq I(x)$.
\end{lemma}


\begin{theorem}
\label{thm:optimal}
(Optimality) Given $\mathcal{G}$, the strategy profile $\mathbf{x}$ returned by Algorithm~\ref{alg:MaxInvest} is a PSNE that maximizes the social welfare.
\end{theorem}


\subsection{Analysis of the Outer Problem}
\label{subsec:outer}

In this part, we analyze the properties of the outer problem under certain conditions. %even though the problem is NP-hard in general. 
Note that the MaxInvest algorithm for the inner problem has established the relationship between any promotion network $\mathcal{G}$ and the maximum social welfare it can achieve. Such a relationship can be described by the optimal social welfare function (Definition \ref{def:sigmafunc}): 
\begin{Definition}
\label{def:sigmafunc}
We define the optimal social welfare function $\sigma$ as a map from a given edge set $\mathcal{E} \subseteq \mathcal{E}^{in}$ to the social welfare of the profile $\bx$ returned by the MaxInvest 
$\sigma(\mathcal{E}) = SW(\bx|(\mathcal{V}, \mathcal{E}))$, where $\bx = MaxInvest(\mathcal{E})$. 
\end{Definition}

Accordingly, the outer problem of OSDSP can be rewritten as: 
\begin{eqnarray}
\max~ \sigma(\mathcal{E}) & \mathrm{s.t.}~ |\mathcal{E}| \leq B. 
\end{eqnarray}
Note that we don't have the closed form of the function $\sigma\left(\cdot\right)$, which can be only evaluated by running the MaxInvest algorithm (Algorithm \ref{alg:MaxInvest}). Next, we first prove that $\sigma(\mathcal{E})$ is \emph{monotonic} (Theorem \ref{thm:monotonicity}) and super-modular when $\frac{c_i}{w_{j,i}} \leq 1, \forall (i, j) \in \mathcal{E}^{in}$ (Theorem \ref{thm:sup}). \looseness = -1% , which paves the way for us to design the heuristic algorithm (in Section \ref{subsec:heuristic}).  

%\fj{We many add some discussion of the trivial solution}
\begin{Definition}
We define the optimal investment function of a given edge set $\mathcal{E} \subseteq \mathcal{E}^{in}$ as the number of the invest agents of the profile $\bx$ returned by the MaxInvest $I(\mathcal{E}) = \sum_{i\in \mathcal{V}} x_i$, where $\bx = MaxInvest(\mathcal{E})$.
\end{Definition}


\begin{theorem}
\label{thm:monotonicity}
(Monotonicity) 
For all pairs of the edge sets $\mathcal{S}$ and $\mathcal{T}$ such that $\mathcal{S}\subseteq \mathcal{T} \subseteq \mathcal{E}^{(in)}$, we have (1) $MaxInvest(\mathcal{S}) \leq MaxInvest(\mathcal{T})$, (2) $I(\mathcal{S}) \leq I(\mathcal{T})$,and (3) $\sigma(\mathcal{S}) \leq \sigma(\mathcal{T})$.
\end{theorem}



\begin{theorem}
\label{thm:sup} 
The optimal social welfare function $\sigma(\mathcal{E})$ is super-modular when $\frac{c_i}{w_{j,i}} \leq 1, \forall (i, j) \in \mathcal{E}^{in}$.  
\end{theorem}

\begin{proof}
We prove that for any edge $e^* \in \mathcal{E}^{(in)}$, and all pairs of the set $\mathcal{S}\subseteq \mathcal{T} \subseteq \mathcal{E}^{(in)}$, $\sigma(\cdot)$ satisfies 
$\sigma(\mathcal{S}\cup \{e^*\}) - \sigma(\mathcal{S}) \leq \sigma(\mathcal{T}\cup \{e^*\}) - \sigma(\mathcal{T})$. Detailed proof and general cases for both the optimal investment function and social welfare function are presented in the supplementary material.
\end{proof}

\subsection{Heuristic Algorithms of the Outer Problem}
\label{subsec:heuristic}


\begin{algorithm}[H] 
\caption{Greedy($\mathcal{G}^{in} = (\mathcal{V},\mathcal{E}^{in}),B$)} \label{alg:greedy}
\begin{algorithmic}[1] 
\State Initialize $\mathcal{E} = \emptyset$;
\While{$|\mathcal{E}| \leq B$}
\ForAll{edge $e\in\{\mathcal{E}^{in}\backslash \mathcal{E}\}$} 
\State $\Delta_e = \sigma(\mathcal{E}\cup \{e\}) - \sigma(\mathcal{E})$;  
\EndFor 
\State $\mathcal{E} = \mathcal{E} \cup \{e_* = argmax_e \Delta_e \}$;
\EndWhile
\State \textbf{Output} $\mathcal{E}$
\end{algorithmic} 
\end{algorithm}

Based on the insights obtained from Section \ref{subsec:inner} and Section \ref{subsec:outer}, in this section, we provide two heuristic algorithms to solve OSDSP. 
\newline \textbf{Greedy}. The first naive heuristic is the Greedy algorithm, of which the pseudo-code is shown in Algorithm \ref{alg:greedy}. The algorithm initializes the promoted edge set $\mathcal{E}$ by empty (line 1), and then greedily selects the edge with the highest SW marginal gain and adds it to $\mathcal{E}$ (line 2--5) until the number of promoted edges reaches the budget. 




\noindent \textbf{RankHeuristic (Rank)}.  
The pseudo-code of the second heuristic, RankHeuristic, is shown in Algorithm \ref{alg:Rank}. The basic idea of the algorithm is to first initialize the promoted edge set $\mathcal{E}$ by $\mathcal{E}^{in}$ (line 2), use MaxInvest to obtain the optimal action profile $\mathbf{x}$ (line 3), and rank the active nodes based their $U_i$ (line 4). After that, the algorithm iteratively deactivates the nodes with the least potential contribution to social welfare (lines 6-7) and the edges directed from the nodes (lines 8-9) until the budget is satisfied, i.e., $|\mathcal{E}| \leq B$. 
\begin{algorithm}
	\caption{RankHeuristic}\label{alg:Rank}
	\begin{algorithmic}[1]
	\Procedure{Rank}{$\mathcal{G} = (\mathcal{V}, \mathcal{E}^{\text{in}}), B$}       
		\State \textbf{Initialization: } $\mathcal{E} =   \mathcal{E}^{\text{in}}$; 
		\State $\bx = \text{MaxInvest}(\mathcal{E})$;  
		%\State Calculate a benefit array $\left\{h_1,...,h_n\right\}$, where $h_i = \sum_{j\in \mathcal{V}_i} w_{j,i}x_j$\
        \State Sort the active nodes (with $x_i = 1$, $U_i>0$) in ascending order according to $U_i$ (the ordered users are denoted by as $Q$); 
%        \State Remove the nodes with $x_i = 0$ from $Q$;
        \State Remove the edges connecting the removed nodes from $\mathcal{E}$; 
        \While{$|\mathcal{E}| > B$ and $Q$ is not empty}
            \State Remove the node $i$ with least utility in $Q$;
            %\State Set $x_i = 0$; %Remove the node $i$ from the graph $G$
            \For{$i$'s neighbor $j$}
                \State Remove $e_{i,j}$ from $\mathcal{E}$; 
                %\State Update $h_j = h_j - w_{i,j}$, and $U_j = h_j - c_j$
                %\If{$U_i < 0$}
                %\State Set $x_j = 0$, and $U_j = 0$
                %\EndIf
            \EndFor
            \State $\bx = \text{MaxInvest}(\mathcal{E})$; 
            \State Compute the utility of the remaining users; 
            %\State  Update $U_i$ of each user $i$
            \State Order the nodes based on their utilities update $Q$; 
            %\State Remove nodes with $x_i = 0$ or $U_i =0$ from $Q$;
    \EndWhile
	\EndProcedure
	\State \textbf{Return:} $\mathcal{E}$
	\end{algorithmic}
\end{algorithm}

\subsection{A Mixed Integer Linear Programming} 
\label{subsec:milp}

We offer an optimal solution (at a relatively small scale) using the MILP framework. We let the indicator variable $p_{i,j}$ denote whether the connection between users $i$ and $j$ is promoted, i.e., if $e_{i,j} \in \mathcal{E}$, $p_{i,j} = 1$; otherwise $p_{i,j} = 0$.  As we assume   visibility to be symmetric, i.e. $\mathcal{E}$ is undirected), we have the constraints $p_{i,j} = p_{j,i}, ~ \forall i,j \in \mathcal{V}$. We let $\mathbf{P} = \left\{p_{i,j}\right\}_{n \times n}$. The mixed integer programming (MIP) version of OSDSP can be formulated as 
\normalsize
\begin{eqnarray}
\label{eq:obj1}
\max_{\mathcal{P}} && SW(\mathbf{x},\mathbf{P}) = \sum_{i\in \mathcal{V}}x_i \sum_{e_{j,i} \in \mathcal{E}^{\text{in}}} w_{j,i}x_{j}p_{j,i}\\ \label{eq:obj2}
\mathrm{s.t.}
\label{eq:constraint1}
&& x_i \in \arg \max_{x_i \in \{0, 1\}} x_i \sum_{e_{j,i} \in \mathcal{E}^{\text{in}}} w_{j,i}x_{j}p_{j,i}, \forall i  \\
\label{eq:constraintbudget}
&& \sum_{e_{i,j} \in \mathcal{E}^{\text{in}}} \lambda_{i,j} p_{i,j}  \leq B\\ \label{eq:constraint3}
&& x_i \in \{0, 1\},~\forall i \in \mathcal{N}, 
p_{i,j} \in \{0, 1\},~\forall e_{i,j} \in \mathcal{E}^{\text{in}}
\end{eqnarray}
\normalsize
The constraint Equ. (\ref{eq:obj2}) can be replaced by Equ. (\ref{eq:obj22}): 
\begin{equation}
\label{eq:obj22}
\textstyle x_i \sum_{j} x_{j}w_{j,i}p_{j,i}-c_ix_i \geq \left(1 - x_i\right) \sum_{j} x_{j}w_{j,i}p_{j,i}-c_i\left(1-x_i\right)
\end{equation}
i.e., user $i$ can achieve a higher utility when choosing $x_i$ than choosing $1-x_i$, which reduces the bi-level structure of OSDSP to a single-level MIP problem. Here, we can linearize Equ. (\ref{eq:obj22}) by introducing intermediate variables $m_{i,j}$ for $\forall i,j \in \mathcal{V}$, such that: 
\begin{equation}
    0 \leq m_{j,i} \leq Mx_j \mbox{ and } w_{j,i}p_{j,i} - M(1-x_i) \leq m_{j,i} \leq w_{j,i}p_{j,i}. \label{lin1}
\end{equation}
Accordingly, each $p_{j,i}x_j$ in Equ. (\ref{eq:obj22}) can be replaced by $m_{j,i}$ as 
\begin{equation}
\nonumber    m_{j,i} = \left\{\begin{array}{ll} 
    0 & \mbox{if $x_j = 0$} \\ 
    w_{j,i}p_{j,i} & \mbox{if $x_j  = 1$}
    \end{array}\right. \Rightarrow m_{j,i} = w_{j,i}p_{j,i}x_j.
\end{equation}
Then we linearize the non-linear term $x_im_{i,j}$ by introducing variables $v_i = x_i (\sum_{j}m_{j,i}) \geq 0,\ \forall i \in \mathcal{V}$ and a big positive number $M$. The following constraints should be satisfied :
\begin{eqnarray}
\textstyle    \sum_j  m_{j,i} - M (1-x_i) \leq v_i \leq \sum_{j} m _{j,i}, \forall i\\ \text{ and } 0 \leq v_i \leq M x_i, \forall i% \label{lin2}\\\  \forall i 
    \label{lin3}
\end{eqnarray}
Consequently, OSDSP can be formulated as a MILP: 
% In constraint  (\ref{eq:constraintbudget}), $B$ denotes the server's budget. 
\begin{eqnarray}
\label{eq:milp}
\textstyle \max && SW(\mathbf{x}|\mathcal{G}) = \sum_{i\in \mathcal{V}}SW_i\left(\mathbf{x},\mathbf{P}\right) \\ \label{eq:milpobj2}
~\mathrm{s.t.}&& 
\text{Constraints} (\ref{eq:constraintbudget}) - 
%(\ref{eq:constraint3}), (\ref{eq:constraint4}), (\ref{eq:obj22}), (\ref{lin1}), (\ref{lin2}), 
(\ref{lin3}) % \\ \label{eq:milpconstraints}
% && x_i \in \{0, 1\},~\forall i,~p_{i,j} \in [0, 1],~\forall i,j. 
\end{eqnarray}


\section{Experiments}\label{sec:exp}
%\input{exp.tex}

\begin{figure*}[t]
    \centering
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
       \includegraphics[width = 1.0\linewidth]{./Complete/sw_N32_bf3.0.pdf}
        \caption{SW v.s. $\eta$ of Complete networks, $B = 96$ over total $|\mathcal{E}^{in}| = 1024$} %Performance of 32-node Complete Network.
        \label{fig:sw_complete}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
       \includegraphics[width = 1.0\linewidth]{./BA/ba_N32_bf1.0.pdf}
        \caption{SW v.s. $\eta$ of BA networks, $B = 32$ over total $|\mathcal{E}^{in}| = 87$} 
        \label{fig:sw_ba}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
        \includegraphics[width = 1.0\linewidth]{./Complete/bf_sw_N32_p0.4.pdf}
        \caption{SW v.s. budget factor in Complete networks, $\eta = 0.5$}
        \label{fig:sw_b}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
       \includegraphics[width = 1.0\linewidth]{./ER/sw_creationP_N32_p0.5_bf3.0.pdf}
        \caption{SW v.s. Edge Creation Probability in ER networks. $\eta = 0.5, b = 3.0$}
        \label{fig:sw_pe}
    \end{minipage}
\end{figure*}

\begin{figure*}[t]
    \centering
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
       \includegraphics[width = 1.0\linewidth]{./Complete/Runtime/T_B3.0_p1.0.pdf}
        \caption{Running time (mean) v.s. $N$ in the Complete networks. } 
        \label{fig:T_N}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
       \includegraphics[width = 1.0\linewidth]{./Complete/Runtime/T_B3.0_N32.pdf}
        \caption{Running time (mean) v.s. $\eta$ in the 32-node Complete networks. }%Performance of 32-node Complete Network.
        \label{fig:T_eta}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
        \includegraphics[width = 1.0\linewidth]{./Complete/Runtime/T_MILP_B3.0_N32.pdf}
        \caption{Running time of the MILP solutions v.s. $\eta$ in Complete Networks.}
        \label{fig:T_MILP}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
        \centering
       \includegraphics[width = 1.0\linewidth]{./Complete/Runtime/T_creationP_N32_p1.0_bf3.0.pdf}
        \caption{Running time (mean) v.s. Edge Creation Prob. in ER networks. }
        \label{fig:T_pe}
    \end{minipage}
\end{figure*}

We conduct extensive experiments on both (1) synthetic networks and (2) real-world data representing users' online conversations during the Covid-19 crisis. For the latter set of experiments, we collected and labeled a rich dataset from the Reddit community, and following, we discuss our results in the context of that dataset.
Our experiments are implemented in Python. All experiments have been performed on an Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz. \footnote{The code is available at \url{https://github.com/jtongxin/CSD-NDG}}
\subsection{Experiments on Synthetic Data}
This section presents our results for the following algorithmic solutions to solve the OSDSP problem.
(1) \textbf{Betweeness:} As the simplest baseline, we consider an algorithm that selects edges based on the descending betweenness centrality. A node's betweenness is the number of shortest paths from every pair of nodes that pass through the node. 
(2) \textbf{Greedy:} The greedy algorithm described in Algorithm \ref{alg:greedy}.
(3) \textbf{MILP:} We use the optimization tool CPLEX \cite{CPLEX} to compute the exact solution formulated in Eq.~(\ref{eq:milp}). 
(4) \textbf{Rank:} The RankHeuristic algorithm described in Algorithm~\ref{alg:Rank}.



We evaluate the performance of each algorithmic solution on the following three types of synthetic networks, $\mathcal{G} = (\mathcal{N}, \mathcal{E}^{in})$: 
(1) Complete networks, which implies the full rights of the network operator to promote contents (link) for pair of nodes; (2) Erdős-Rényi networks~\cite{erdos59a}, which choose each of the possible edges with a given probability;
(3) Barabasi-Albert (BA) networks~\cite{Barabsi509}, scale-free networks with power-law degree distribution, and common in the online world. In our network generation process, 3 edges are attached to a new node from existing nodes. % (3) Erdős-Rényi networks~\cite{erdos59a}, which choose each possible edge with a given probability.
We present our findings for the optimality of the solutions and run-time. \looseness = -1
% \subsubsection{Performance Analysis} \label{sec:performance}


\noindent \textbf{Performance Analysis}. We set the influence weight between agents as $w_{i,j} = w = 1/N, \forall i,j \in \mathcal{V}$. We denote the user's cost-weight ratio as $\gamma = \frac{c}{w}$. We assume that a node has probability $\eta$ to have $\gamma \leq 1$ (private cost is set to $c \sim U(0,w)$) and probability $1-\eta$ to be ($c \sim U(w,N)$).
The budget $B$ is set to $B = b  N$, where $b$ is denoted as a budget factor, and $N$ is the number of nodes in the network.


We show social welfare results with varying $\eta$ parameters in the 32-node BA and Complete network with 20 random seeds. In Figure~\ref{fig:sw_complete} and Figure~\ref{fig:sw_ba}, the mean social welfare increases when $\eta$ increases from $0$ to $1$. 
%High $\eta$ implies more nodes with low costs. 
We can see that the Rank algorithm approximates the optimal solution well when $\eta$ is low, whereas the Greedy algorithm shows better performance when facing the high $\eta$. This is because (1) when $\eta$ is low, the Rank algorithm efficiently removes the edges connecting to nodes that are expensive and hard to incentivize; and (2) when $\eta$ is high, the degree of supermodularity of the $\sigma(\cdot)$ is bounded, which benefits the Greedy algorithm. \looseness = -1
%\fj{maybe can analyze this theoretically}{\color{red} no time, this is fine.}

Naturally, a higher promoting budget is beneficial to promote social welfare. Figure~\ref{fig:sw_b} demonstrates the overall social welfare with the budget factor $b$ changed from $1.0$ ($B = 32$) to $4.0$ ($B = 128$). The optimality gap between the Greedy algorithm and MILP optimal solution is generally enlarged when the budget increases, whereas the Rank heuristic shows a stable and good performance. %\fj{Add more reasoning}

Finally, we tune the edge creation probability ($0.3, 0.5, 0.7$, $1.0$) of ER networks and discuss the effect of the search space (implies the size of $\mathcal{E}^{in}$). Figure~\ref{fig:sw_pe} demonstrates that we can find better MILP and Greedy solutions resulting in higher social welfare when we have larger $\mathcal{E}^{in}$. However, the Rank heuristic may have a more significant optimality gap when the edge design space is enlarged. 

% \subsubsection{Run time Analysis} \label{sec:run-time}
\noindent \textbf{Run time Analysis}. 
To test the scalability of this method, we generate networks (Complete, BA, ER) of multiple node sizes and generate the ER networks with different generation probabilities.

Figure~\ref{fig:T_N} shows the average running time in the log scale of the algorithms in Complete networks with $b = 3$ by tuning the size of network $N = 16, 20, 24,32$. We can see that the average running time of MILP achieves more than $10^4$ secs when it goes to $32$ nodes.

We also examine the running time regarding $\eta$ in Figure~\ref{fig:T_eta}. When $\eta$ increases, the average running time of MILP is significantly affected.
Note that the run-time of Greedy, Betweenness, and Rank are stable as excepted (variances are tiny), so we do not plot the error bars. However, the running time of MILP has a huge variance. We can find some cases with incredibly long running time when $\eta = 1.0$, shown in Figure~\ref{fig:T_MILP}.

% with different budget factor $b = 1.0, 2.0,3.0,4.0$...When the budget increases, ... 
Although the MILP method provides an exact solution, it does not scale when the network has hundred of nodes (or more), and neither can greedy algorithms. One idea to reduce the time complexity is to cut the input edge space $\mathcal{E}^{in}$. Figure~\ref{fig:T_pe} shows the running time when $\mathcal{E}^{in}$ is created using a smaller probability, trading off the performance (Figure~\ref{fig:sw_pe}).

\subsection{Experiments on Real Data}
\label{subsec:realexp}

\noindent \textbf{Data Set}. Our dataset represents posts and comments collected from the Reddit online conversation platform. We collected data  from the Subreddit community ``CovidPositive'' at three distinct moments during the pandemic. We consider 5629 posts and 50,526 associated comments from 15,172 unique users.  For each comment or post, we collect the timestamp, message text, author id, and which reply the text is to (if any). \looseness = -1

The three-part dataset represents selected posts and related comments collected from August 2020, September 2020, and April 2021. These periods were selected to capture snapshots of pandemic-centered conversations at different stages. In particular, April 2021 is seen as an important counterpoint to the August and September 2020 data, coinciding with the initial widespread availability of the COVID-19 vaccine. 
For each of the three periods, we create a representative social network graph (shown in Supplementary Material) where nodes are unique users and weighted, directed edges represent pairwise interactions between users in the form of a reply to a post or comment. 
Over all three months, the number of high-SD users totals more than half of all users in each network. This is an exceptional finding based on prior literature and one we suggest is connected with the particularly sensitive nature of the Subreddit. 
Descriptive statistics for each month are given in Table \ref{tab:datastat}. Subreddit is most active in terms of user participation in August 2020.

\begin{table*}[!ht] 
    \centering
    \begin{tabular}{|c|c|c|c|c|c|c|c|}
    \hline 
    \hline
  \textbf{Month}    &  Nodes & Edges & Weighted degree &  Number of SCC &  Size of largest SCC & Modularity\\ \hline
       August 2020 & 6786 & 18688  & 3.3 & 3476 & 3139 & 0.528\\ 
       September 2020  & 3305 & 7950 & 2.9 & 1657 & 1518 & 0.531\\
       April 2021  & 4533 & 13013 & 3.5 & 2273 & 2164 & 0.605\\ \hline
    \end{tabular}
    \caption{Network statistics for the three months of SubReddit data. SCC represents a strongly connected component. Modularity is calculated using the Louvain algorithm on the undirected network.}
    \label{tab:datastat}
\end{table*}

% \subsubsection{Label Generation}\label{sec:label_generation}
\noindent \textbf{Label Generation}. We leverage BERT-based (pre-trained using bidirectional transformers \cite{bert}) approach to identify self-disclosure instances in our dataset. BERT's contextualized word representation and high labeling accuracy make it a suitable choice for this task, with superior performance compared to other existing techniques.

Our training dataset is the OffMyChest conversation dataset used for a self-disclosure detection task  developed as part of the AFFCON 2020 Shared Task \cite{jaidka2020report}. 
The original dataset consists of 12,860 labeled sentences and 5,000 unlabeled sentences sampled from comments on subreddits within the OffMyChest community.
We use the label of informational disclosure, emotional disclosure and do not distinguish between them in our analysis

We then utilized the uncased pre-trained BERT model to fine-tune for annotation, as proposed in \cite{bert}. 
Our loss function for this task was binary cross-entropy.
Precisely, we use 80\% of the Affcon training data for training and the remaining 20\% for validation. 
We use an early stop strategy in training: we stop training when the validation loss does not decrease in 2 epochs. We choose the model that has the smallest loss on the validation set. 
Our F-1 score is 0.76,  (precision 0.90, recall 0.66). We use a batch size of 16, max token size 200, for 10 epochs. 
Our learning rate is set as 2e-5. We obtained a 0.77  F1-score for information disclosure  (precision 0.92, recall 0.66). The F1-score for emotional disclosure is 0.75 (precision 0.87, recall 0.65). This is consistent with the best-known performing model for emotional and information disclosure annotations on this dataset \cite{pant2020}. 
We obtain 27155 self-disclosed (either emotional or informational) sentences out of a total of 50526.



% \subsubsection{Algorithmic Results} \label{sec:real_trace}

\begin{figure}[t]
    %\centering
    
    \begin{minipage}{.23\textwidth}
    
        \centering
      \includegraphics[width = 1\linewidth]{./Real/real_sd_sw.pdf}
        \caption{SW v.s. budget ratio in real data}
        \label{fig:real_sw}
    \end{minipage}
    \hfill
    \begin{minipage}{.23\textwidth}
    
        \centering
      \includegraphics[width = 1\linewidth]{./Real/real_sd_ratio.pdf}
        \caption{SD ratio v.s. budget ratio in real data.}
        \label{fig:real_sd}
    \end{minipage}
\end{figure}

%We simulate our heuristic algorithm 
%using a subset of our Reddit dataset.  Specifically, we consider a subgraph of the larger 
%on the reply networks of August 2020, September 2020, and April 2021. 
\noindent \textbf{Algorithmic Results}.  Given the reply network structures and Reddit dataset of August 2020, September 2020, and April 2021 as a starting point, 
we  build the network  $\mathcal{G}^{in} = (\mathcal{V}, \mathcal{E}^{in})$ as the undirected version of the reply network, and the user's utilities functions $U_i, \forall i \in \mathcal{V}$ as follows.
%Figure \ref{fig:XXX} shows the average values of the coefficients in the OSDSP formulation (Equ. (\ref{eq:obj1})-(\ref{eq:constraint4})), including  
    (1) $w_{i,j}$: The influence weight coefficient of from agent $i$ to $j$, proportion to the number of replies, denoted as $r_{j,i}$ from user $j$ to user $i$ in the reply network. We normalize $w_{i,j}$ by dividing the $\max_{v \in \mathcal{N}} \sum_{u} r_{v,u}$. In this way, $\sum_i w_{i,j} \leq 1$ is guaranteed.
    (2) $c_i$: The privacy cost coefficient of each user $i$. If the agent disclosed in the real data ($x^{\text{real}}_i = 1$), we sample the costs $c_i$ from $U(0, h_i)$, where $h_i = \sum_j w_{j,i} x^{\text{real}}_j$. Otherwise, $c_i \sim U(h_i,1)$. 
We calculate the coefficients $c_i$ based on actual self-disclosure for both $i$'s posts and comments. If the node has ever disclosed in any post or comments during the period, we consider $x^{\text{real}}_i = 1$. 



We use the RankHeuristic algorithm to simulate the content promotion and NDG over the networks, varying the budget ratio $b = B/N$ from $0.2$ to $3.0$. Each experiment averages over 20 random seeds. Figure \ref{fig:real_sw} compares the overall social welfare over the three months. We can observe that with the increase of the budget limit, the overall social welfare increases monotonically when $b<1.0$ in all three months. 
When considering overall self-disclosure, many users are associated with low privacy cost coefficients. Thus, RankHeuristic can find an edge set that can induce saturated social welfare within a relatively small budget.

As reported in Figure \ref{fig:real_sd}, we also find that the self-disclosure ratio achieved by our algorithm in September is high, and the ratio in April is low when the budget is low. When we increase the budget, the gap between them is shrunk. 

Due to the limited space, the computation time of RankHeuristic for the three months is listed in Table 1 in the supplementary file.



\section{Conclusions and Discussions}
\label{sec:Conclusion}
%\input{conclusion.tex}


In this work, we have developed a  framework to rigorously model the impact of users' decisions in sharing personal information in online communities, wherein these individuals seek social reward, e.g., emotional support during a crisis, and in doing so, incur a cost to privacy. The presented theoretical results enable modeling social welfare on a directed network of interacting users. 
Critically, our approach allows us to find social network structures that optimize social welfare within platform constraints while respecting individual users' heterogeneous privacy preferences.    
Our research can guide the development of effective practices for ranking and recommending content on platforms, particularly in online communities that prioritize social support and connections despite potential privacy risks.

 
Despite the merits of this work, we acknowledge its limitations.
A limitation of our work is precise mapping to the complexity of real-world scenarios and behaviors. Relatedly, it is knowingly difficult to quantify users' privacy preferences or individual benefits. Hence, fully validating this work is an important next step.
Besides, the potential impact of fairness on content sharing, specifically the potential for biased treatment resulting in unequal access to opportunities for content promotion and shared information, has not been thoroughly examined when maximizing social welfare. Addressing fairness issues is a crucial aspect of future work.

Additional work should explore dependencies between social contexts where self-disclosure and support behaviors may be differently aligned, for instance, considering users' selfish behavior and negative/non-supportive interactions.  
Follow-on studies might also consider inducing the desired equilibrium~\cite{yu2021altruism} to incentivize social welfare convergence at a reasonable pace. 
Finally, future work could fit this model to other datasets, varying in size, membership, audience, and topical focus. 




\begin{acknowledgements} 
This work was partially funded by the National Science Foundation under grant 2027757.
\end{acknowledgements}

% References 
\bibliography{disGame}


\end{document}
