On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections
01 Dec 2021 | graphs fairness representation learning
This blog post discusses the ICLR 2021 paper “On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections” by Li et al., highlighting the importance of its theoretical results while critiquing the notions and applications of dyadic fairness presented. This blog post assumes basic familiarity with graph representation learning using message-passing GNNs and fairness based on observed characteristics.
Motivation
Link prediction is the task of predicting unobserved connections between nodes in a graph. For example, as shown in the social network in Figure 1, a link prediction algorithm may leverage the observed edges (solid lines) to predict that the node representing Sophia is also connected to the nodes representing Adam and David (dashed lines).
Link prediction is ubiquitous, with applications ranging from predicting interactions between protein molecules to predicting if a paper in a citation network should cite another paper. Furthermore, social media sites may use link prediction, as part of viral marketing, to show ads to users who are predicted to be connected to other users who have interacted with the ads, because they are assumed to be similar.
However, ads can influence users’ actions, and if a link prediction algorithm is tainted by social biases or exhibits disparate performance for different groups, this can have negative societal consequences. For instance, a social media site may only spread ads for STEM jobs within overrepresented groups like white men, rather than to women and gender minorities of color, because a link prediction algorithm disproportionately predicts connections between members of overrepresented groups, and not between members of different groups. This can reduce job applications from marginalized communities, exacerbating already-existing disparities and reducing diverse perspectives in STEM.
In the running social network example, suppose Sophia is a woman and that Adam and David are men. Furthermore, suppose David interacts with an ad for a software engineering position. In Figure 2, the left panel illustrates a binary gender-biased link prediction algorithm that only predicts a connection between David and Adam, and not between Adam and Sophia; this would result in only Adam and not Sophia seeing the software engineering ad. In contrast, the right panel illustrates a link prediction algorithm that predicts a connection between David and Adam and between David and Sophia. This algorithm satisfies what is called dyadic fairness (with respect to binary gender), as it predicts an equal rate of man-woman and man-man links; this could have a lower likelihood of amplifying binary gender biases and disparities.
While I have constructed the example above, the authors of the paper “On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections” provide two other applications of dyadically-fair link prediction:
- delivering “unbiased” recommendations (i.e. recommendations that are independent of sensitive attributes like religion or ethnicity) for other users to friend, follow, or connect with on a social media site
- recommending diverse news sources to users, independent of their political affiliation
While polarization is a problem online, these applications of dyadically-fair link prediction could be problematic. Many marginalized communities (e.g. LGBTQIA+ folks, Black individuals) create and rely on the sanctity of safe spaces online. Thus, recommending users or news sources that are hostile (e.g. promote homophobic, racist, or sexist content) to people in these communities can result in severe psychological harm and a violation of privacy. Furthermore, many individuals in these communities, because they feel isolated in real life, actually yearn to find other users online who share their identity, to which dyadic fairness is antithetical. In these cases, dyadic fairness doesn’t distribute justice.
High-Level Idea
The paper “On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections” contributes the following:
- mathematical formalizations of dyadic fairness
- a theoretical analysis of the relationship between the dyadic fairness of a graph convolutional network (GCN) and graph structure (for homogeneous graphs)
- based on the theoretical analysis, an algorithm FairAdj that jointly optimizes the utility of link prediction and dyadic fairness of a GNN
Formalizations of Dyadic Fairness
Suppose we have a directed, homogeneous graph $ G = (V, E) $, consisting of a fixed set of nodes $V$ and fixed set of edges $E$. Furthermore, assume that every node in $ V $ has a binary sensitive attribute, that is, it belongs to one of two sensitive groups.
To prevent notation overload upfront, I will present one mathematical formalization of dyadic fairness first and then dissect the notation. This formalization is based on Independence (also known as demographic parity or statistical parity) from the observational group fairness literature.
In Definition 3.1, $ g $ is the link prediction algorithm. It takes as input the representations of two nodes, which we will denote as $u$ and $v$, and outputs a predictive score representing the likelihood of a connection between $u$ and $v$. $ S $ is a function that takes as input a node $ i $ and outputs the sensitive group membership of $i$. For instance, in the running social network example, $ S(Sophia) = woman $.
We define intra-links as edges connecting nodes belonging to the same sensitive group, and similarly, inter-links as edges connecting nodes belonging to different protected groups. As shown in Figure 4, $(David, Adam)$ is an intra-link, while $(David, Sophia)$ is an inter-link.
Then, we can see that this formalization of dyadic fairness simply requires that our link prediction algorithm predicts intra-links and inter-links at the same rate from the set of candidate links.
The authors do empirically explore other formalizations of dyadic fairness based on Separation:
- the disparity in predictive score between intra-links and inter-links for only positive links, i.e. $ Pr(g(u, v) | S(u) = S(v), (u, v) \in E) = Pr(g(u, v) | S(u) \neq S(v), (u, v) \in E) $
- the disparity in predictive score between intra-links and inter-links for only negative links, i.e. $ Pr(g(u, v) | S(u) = S(v), (u, v) \notin E) = Pr(g(u, v) | S(u) \neq S(v), (u, v) \notin E) $
- the maximum difference in the true negative rate (over all possible thresholds on the predictive score) between intra-links and inter-links
- the maximum difference in the false negative rate (over all possible thresholds on the predictive score) between intra-links and inter-links
It appears that the authors don’t explore possible formalizations of dyadic fairness based on Sufficiency: $ Pr((u, v) \in E | S(u) = S(v), g(u, v)) = Pr((u, v) \in E | S(u) \neq S(v), g(u, v)) $; $ Pr((u, v) \in E | S(u) = S(v), g(u, v)) = Pr((u, v) \in E | S(u) \neq S(v), g(u, v)) $. This could be an area for further exploration.
The formalizations of dyadic fairness based on Independence, Separation, are Sufficiency are notably mutually-exclusive except in degenerate cases (for the proof of this, consult Fairness and Machine Learning).
Furthermore, each notion of fairness has its own politics and limitations. While Independence may seem desirable because it ensures that links are predicted independently of (possibly irrelevant) sensitive attributes, it can also have undesirable properties. For instance, a social network may have significantly more training examples of intra-links than inter-links, which could cause a learned link predictor to have a lower error rate for intra-links than inter-links. To be concrete, suppose this link predictor accurately predicts intra-links at a rate $p$ and simultaneously randomly predicts inter-links at a rate $p$. This link predictor satisfies Independence, but has wildly different error rates on intra-links and inter-links.
Furthermore, Independence does not consider correlations between the existence of a link (the target variable) and whether it’s an intra-link or inter-link. In contrast, Separation and Sufficiency “accommodate” correlations between the existence of a link and if it’s an intra-link or inter-link.
All of Independence, Separation, and Sufficiency are limited in that they are based on historical data and observed attributes. Causality and counterfactual hypotheses are emerging lenses through which fairness can be observed under intervention. However, all of the aforementioned methods and criteria assume that sensitive attributes and identities are:
- known, which is often not the case due to privacy laws and the danger involved in disclosing certain sensitive attributes (e.g. disability, queerness, etc.);
- measurable, which is almost never true (e.g. gender from a non-binary-inclusive understanding);
- discrete, which is almost never never true and reinforces hegemonic, imperialist categorizations (e.g. race options on the US census, the gender binary, etc.);
- static, which is problematic given that one’s identity can change over time (e.g. genderfluidity).
Furthermore, observational fairness neglects that some communities face complex, intersecting vectors of marginality that preclude their presence in the very datasets observed for fairness.
These described limitations are beyond the scope of this paper, but could motivate future work in the areas of fair graph machine learning without access to sensitive attributes, with human-in-the-loop approaches to modeling fluid, flexible identities, etc.
How does graph structure affect dyadic fairness?
In this section, we only consider the formalization of dyadic fairness in Definition 3.1, i.e. based on Independence. Suppose we have two sensitive groups $S_0$ and $S_1$.
Let’s dissect what Proposition 4.1 means! Proposition 4.1 makes the assumption that our link prediction function $g$ is modeled as an inner product of the two input node representations. In this case, we can show that $\Delta_{DP}$, the disparity in the expected predictive score of intra-links and expected predictive score of inter-links, is bounded by a constant times $\delta$, the disparity in the expected representation of nodes in $S_0$ and expected representation of nodes in $S_1$.
Why is this cool? It implies that a low $\delta$ is a sufficient condition for a low $\Delta_{DP}$. Now we need to study how a graph neural network (GNN) affects $\delta$! As usual, I will present Theorem 4.1 and then dissect the notation.
Theorem 4.1 looks at $\Delta_{DP}^{Aggr}$, the disparity in the expected representation of nodes in $S_0$ and expected representation of nodes in $S_1$ after one mean-aggregation over the graph. A mean-aggregation uses the graph filter $D^{-1} A$, where $A$ is the graph’s adjacency matrix with self-loops and $D$ is the diagonal degree matrix corresponding to $A$. However, many other graph filters are used in a variety of message-passing algorithms:
- $D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$, which is the symmetric reduced adjacency matrix used in Graph Convolutional Networks (GCNs);
- $A D^{-1}$, which is the random walk matrix used in belief propagation and label propagation;
- $softmax(\frac{(Q X)^T (K X)}{\sqrt{d_k}})$, which is the scaled dot product attention matrix used in Transformers
Even an iteration of reinforcement learning can be reframed as applying a graph filter to a graph containing nodes representing states and actions!
The beauty of the proof of Theorem 4.1 is that, since aggregation is a central operation in every message-passing algorithm, the general procedure used in the proof can be followed to analyze representation disparities produced by label propagation, Transformers, etc.
Back to dissecting Theorem 4.1! We will only look at the upper bound on $\Delta_{DP}^{Aggr}$. (One can look at the paper for the precise definitions of notation.) $\lVert \mu_0 - \mu_1 \rVert_2$ is the disparity in the expected representation of nodes in $S_0$ and expected representation of nodes in $S_1$ prior to the mean-aggregation over the graph. Hence, we can see that $\alpha_{max}$ functions as a contraction coefficient and $2 \sqrt{M} \sigma$ serves as a sort of error term on the contraction. $\alpha_{max}$ is regulated by the weights of inter-links (relative to the maximum degree) and number of nodes in each sensitive group incident to inter-links (relative to the number of nodes in the group).
The authors provide an excellent analysis characterizing various networks and their corresponding $\alpha_{max}$ in the paper! $\alpha_{max}$ must be less than 1 for the mean-aggregation to reduce the disparity in expected representations; this is not the case for only a few graph families, e.g. complete bipartite graphs.
As future work, it would be interesting to explore how the contraction coefficient $\alpha_{max}$ varies for different message-passing algorithms. It would also be exciting to investigate how this analysis changes for graphs that are heterophilic rather than homophilic, or heterogeneous instead of homogeneous.
Corollary 4.1 incorporates the parameters of a GCN into the bound on the disparity in expected representations, but I will not cover the corollary in this post.
FairAdj
FairAdj is based on the idea that since a low $\delta$ is a sufficient condition for a low $\Delta_{DP}$, and the bound on $\delta$ is affected by $\alpha_{max}$, we can modify the graph’s adjacency matrix to improve the fairness of an inner-product link prediction algorithm based on representations learned by a GNN.
The authors propose a simple, effective solution of alternating between training the GNN and optimizing the adjacency matrix for dyadic fairness via projected gradient descent, where the set of feasible solutions is right-stochastic matrices of the form $D^{-1} A$ with the same set of edges as the original adjacency matrix. FairAdj provides a general algorithmic skeleton for improving the fairness of a host of message-passing algorithms via projected gradient descent. This skeleton could be applied to label propagation, for example.
The authors also run experiments showing via clustering that FairAdj, as a byproduct, decreases $\delta$. Because a low $\delta$ is a sufficient condition for a low $\Delta_{DP}$, an alternative to FairAdj could be projecting learned node representations into a set of feasible solutions that satisfy $\delta \leq \epsilon$, for a small, fixed $\epsilon > 0$. The feasible solutions would form a closed, convex set. It would further be interesting to explore the convergence rate and guarantees and optimality conditions of FairAdj, and compare them to the convergence and optimality of the alternative solution.
The authors evaluate FairAdj on six real-world datasets, but discussion of these experiments is out of the scope of this blog post. I would be interested to visualize how tight the proven bounds for $\delta$ and $\Delta_{DP}$ are for the real-world datasets.
As a final note, the authors claim that FairAdj enjoys a superior “fairness-utility tradeoff” compared to baseline dyadic fairness algorithms. In general, for fairness-related work, I believe we should move away from the terminology of “fairness-utility tradeoff” to denote a decrease in accuracy on a specific test set due to a fairness constraint, as it insinuates that fairness is incompatible with a “well-performing” model. Test sets inevitably contain biased samples, and it is possible that a fair algorithm may not yield a lower test accuracy if we somehow had access to the full test distribution. Furthermore, we must ask, “Utility for whom?”; fairness can greatly increase the utility of an algorithm for minoritized groups, even if the overall test accuracy decreases. Thus, I advocate for not perpetuating the notion of a “fairness-utility tradeoff.”
Conclusion
This blog post discusses the ICLR 2021 paper “On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections” by Li et al., highlighting the importance of its theoretical results while critiquing the notions and applications of dyadic fairness provided. This paper presents a beautiful proof that can be followed to analyze representation disparities produced by various message-passing algorithms, and an algorithmic skeleton for improving the fairness of many message-passing algorithms. At the same time, it is essential that, as a community, we critically analyze for which applications a fair algorithm can distribute justice and contextualize our understandings of the politics and limitations of different notions of fairness in applications.