Title: Distribution-Free Statistical Dispersion Control for Societal Applications

Abstract: Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications it is crucial to understand and control the dispersion of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.

Section: Introduction
Learning-based predictive algorithms are widely used in real-world systems and have significantly impacted our daily lives. However, many algorithms are deployed without sufficient testing or a thorough understanding of likely failure modes. This is especially worrisome in high-stakes application areas such as healthcare, finance, and autonomous transportation. In order to address this critical challenge and provide tools for rigorous system evaluation prior to deployment, there has been a rise in techniques offering explicit and finite-sample statistical guarantees that hold for any unknown data distribution and black-box algorithm, a paradigm known as distribution-free uncertainty quantification (DFUQ). In [1], a framework is proposed for selecting a model based on bounds on expected loss produced using validation data. Subsequent work [30] goes beyond expected loss to provide distribution-free control for a class of risk measures known as quantile-based risk measures (QBRMs) [8]. This includes (in addition to expected loss): median, value-at-risk (VaR), and conditional value-at-risk (CVaR) [23]. For example, such a framework can be used to get bounds on the 80th percentile loss or the average loss of the 10% worst cases.
While this is important progress towards the sort of robust system verification necessary to ensure the responsible use of machine learning algorithms, in some scenarios measuring the expected loss or value-at-risk is not enough. As models are increasingly deployed in areas with long-lasting societal consequences, we should also be concerned with the dispersion of error across the population, or the extent to which different members of a population experience unequal effects of decisions made based on a model's prediction. For example, a system for promoting content on a social platform may offer less appropriate recommendations for the long tail of niche users in service of a small set of users with high and typical engagement, as shown in [19]. This may be undesirable from both a business and societal point of view, and thus it is crucial to rigorously validate such properties in an algorithm prior to deployment and understand how the outcomes disperse. To this end, we offer a novel study providing rigorous distribution-free guarantees for a broad class of functionals including key measures of statistical dispersion in society. We consider both differences in performance that arise between different demographic groups as well as disparities that can be identified even if one does not have reliable demographic data or chooses not to collect them due to privacy or security concerns. Well-studied risk measures that fit into our framework include the Gini coefficient [33] and other functions of the Lorenz curve as well as differences in group measures such as the median [5]. See Figure 4 for a further illustration of loss dispersion.
Figure 1: Example illustrating how two predictors (here h 1 and h 2 ) with the same expected loss can induce very different loss dispersion across the population. Left: The loss CDF produced by each predictor is bounded from below and above. Middle: The Lorenz curve is a popular graphical representation of inequality in some quantity across a population, in our case expressing the cumulative share of the loss experienced by the best-off β proportion of the population. CDF upper and lower bounds can be used to bound the Lorenz curve (and thus Gini coefficient, a function of the shape of the Lorenz curve). Under h 2 the worst-off population members experience most of the loss. Right: Predictors with the same expected loss may induce different median loss for (possibly protected) subgroups in the data, and thus we may wish to bound these differences.
In order to provide rigorous guarantees for socially important measures that go beyond expected loss or other QBRMs, we provide two-sided bounds for quantiles and nonlinear functionals of quantiles. Our framework is simple yet flexible and widely applicable to a rich class of nonlinear functionals of quantiles, including Gini coefficient, Atkinson index, and group-based measures of inequality, among many others. Beyond our method for controlling this richer class of functionals, we propose a novel numerical optimization method that significantly tightens the bounds when data is scarce, extending earlier techniques [21,30]. We conduct experiments on toxic comment moderation, detecting genetic mutations in cell images, and online content recommendation, to study the impact of our approach to model selection and tailored bounds.
To summarize our contributions, we: (1) initiate the study of distribution-free control of societal dispersion measures; (2) generalize the framework of [30] to provide bounds for nonlinear functionals of quantiles; (3) develop a novel optimization method that substantially tightens the bounds when data is scarce; (4) apply our framework to high-impact NLP, medical, and recommendation applications.

Section: Problem setup
We consider a black-box model that produces an output Z on every example. Our algorithm selects a predictor h, which maps an input Z ∈ Z to a prediction h(Z) ∈ Ŷ. A loss function ℓ : Ŷ × Y → R quantifies the quality of a prediction Ŷ ∈ Ŷ with respect to the target output y ∈ Y. Let (Z, Y ) be drawn from an unknown joint distribution P over Z × Y. We define the random variable X h := ℓ(h(Z), Y ) as the loss induced by h on P. The cumulative distribution function (CDF) of the random variable X h is F h (x) := P(X h ≤ x). For brevity, we sometimes use X and F when we do not need to explicitly consider h. We define the inverse of a CDF (also called inverse CDF) F as F -(p) = inf{x : F (x) ≥ p} for any p ∈ R. Finally, we assume access to a set of validation samples (Z, Y ) 1:n = {(Z 1 , Y 1 ), . . . , (Z n , Y n )} for the purpose of achieving distribution-free CDF control with mild assumptions on the loss samples X 1:n . We emphasize that the "distribution-free" requirement is on (Z, Y ) 1:n instead of X 1:n , because the loss studied on the validation dataset is known to us and we can take advantage of properties of the loss such as boundedness.

Section: Statistical dispersion measures for societal applications
In this section, we motivate our method by studying some widely-used measures of societal statistical dispersion. There are key gaps between the existing techniques for bounding QBRMs and those needed to bound many important measures of statistical dispersion. We first define a QBRM: Definition 1 (Quantile-based Risk Measure). Let ψ(p) be a weighting function such that ψ(p) ≥ 0 and 1 0 ψ(p) dp = 1. The quantile-based risk measure defined by ψ is
R ψ (F ) := 1 0 ψ(p)F -(p)dp.
A QBRM is a linear functional of F -, but quantifying many common group-based risk dispersion measures (e.g. Atkinson index) also involves forms like nonlinear functions of the (inverse) CDF or nonlinear functionals of the (inverse) CDF, and some (like maximum group differences) further involve nonlinear functions of functionals of the loss CDF. Thus a much richer framework for achieving bounds is needed here.
For clarity, we use J as a generic term to denote either the CDF F or its inverse F -depending on the context, and summarize the building blocks as below: (i) nonlinear functions of J, i.e. ξ(J); (ii) functionals in the form of integral of nonlinear functions of J, i.e. ψ(p)ξ(J(p))dp for a weight function ψ; (iii) composed functionals as nonlinear functions of functionals for the functional T (J) with forms in (ii), i.e. ζ(T (J)) for a non-linear function ζ.

Section: Standard measures of dispersion
We start by introducing some classic non-group-based measures of dispersion. Those measures usually quantify wealth or consumption inequality within a social group (or a population) instead of quantifying differences among groups. Note that for all of these measures we only consider non-negative losses X, and assume that
1 0 F -(p)dp > 0 1 .
Gini family of measures. Gini coefficient [33,34] is a canonical measure of statistical dispersion, used for quantifying the uneven distribution of resources or losses. It summarizes the Lorenz curve introduced in Figure 4. From the definition of Lorenz curve, the greater its curvature is, the greater inequality there exists; the Gini coefficient is measuring the ratio of the area that lies between the line of equality (the 45 • line) and the Lorenz curve to the total area under the line of equality. Definition 2 (Gini coefficient). For a non-negative random variable X, the Gini coefficient is
G(X) := E|X -X ′ | 2EX = 1 0 (2p -1)F -(p)dp 1 0 F -(p)dp
, where X ′ is an independent copy of X. G(X) ∈ [0, 1], with 0 indicating perfect equality.
Because of the existence of the denominator in the Gini coefficient calculation, unlike in QBRM we need both an upper and a lower bound for F -(see Section 4.1.1). In the appendix, we also introduce the extended Gini family.
Atkinson index. The Atkinson index [2,19] is another renowned dispersion measure defined on the non-negative random variable X (e.g., income, loss), and improves over the Gini coefficient in that it is useful in determining which end of the distribution contributes most to the observed inequality by choosing an appropriate inequality-aversion parameter ε ≥ 0. For instance, the Atkinson index becomes more sensitive to changes at the lower end of the income distribution as ε increases. Definition 3 (Atkinson index). For a non-negative random variable X, for any ε ≥ 0, the Atkinson index is defined as the following if ε ̸ = 1:
A(ε, X) := 1 - (E[X 1-ε ]) 1 1-ε E[X] = 1 - 1 0 (F -(p)) 1-ε dp 1 1-ε 1 0 F -(p)dp .
And for ε = 1, A(1, X) := lim ε→1 A(ε, X), which will converge to a form involving the geometric mean of X. A(ε, X) ∈ [0, 1], and 0 indicates perfect equality (see appendix for details).
The form of Atkinson index includes a nonlinear function of F -, i.e. (F -) 1-ε , but this type of nonlinearity is easy to tackle since the function is monotonic w.r.t. the range of F -(see Section 4.2.1). Remark 1. The reason we study the CDF of X and not X 1-ε is that it allows us to simultaneously control the Atkinson index for all ε's.
In addition, there are many other important measures of dispersion involving more complicated types of nonlinearity such as the quantile of extreme observations and mean of range. Those measures are widely used in forecasting weather events or food supply. We discuss and formulate these dispersion measures in the appendix.

Section: Group-based measures of dispersion
Another family of dispersion measures refer to minimizing differences in performance across possibly overlapping groups in the data defined by (protected) attributes like race and gender. Under equal opportunity [11], false positive rates are made commensurate, while equalized odds [11] aims to equalize false positive rates and false negative rates among groups. More general attempts to induce fairly-dispersed outcomes include CVaR fairness [32] and multi-calibration [14,15]. Our framework offers the flexibility to control a wide range of measures of a group CDF F g , i.e. T (F g ), as well as the variation of T (F g ) between groups. As an illustration of the importance of such bounds, [5] finds that the median white family in the United States has eight times as much wealth as the median black family; this motivates a dispersion measure based on the difference in group medians.
Absolute/quadratic difference of risks and beyond. The simplest way to measure the dispersion of a risk measure (like median) between two groups are quantities such as
|T (F g ) -T (F g ′ )| or [T (F g )-T (F g ′ )]
2 . Moreover, one can study ξ(T (F g )-T (F g ′ )) for some general nonlinear functions. These types of dispersion measures are widely used in algorithmic fairness [11,20].
CVaR-fairness risk measure and its extensions. In [32], the authors further consider a distribution for each group, P g , and a distribution over group indices, P Idx . Letting CV aR α,P Z (Z) := E Z∼P Z [Z|Z > α] for any distribution P Z , they define the following dispersion for the expected loss of group g (i.e. µ g := E X∼Pg [X]) for α ∈ (0, 1):
D CV,α (µ g ) := CV aR α,PIdx µ g -E g∼PIdx [µ g ] .
A natural extension would be D CV,α (T (F g )) for general functional T (F g ), which we can write in a more explicit way [23]:
D CV,α (T (F g )) = min ρ∈R ρ + 1 1 -α • E g∼PIdx [T (F g ) -ρ] + -E g∼PIdx [T (F g )].
The function [T (F g ) -ρ] + is a nonlinear function of T (F g ), but it is a monotonic function when ρ is fixed and its further composition with the expectation operation is still monotonic, which can be easily dealt with.
Uncertainty quantification of risk measures. In [4], the authors study the problem of uncertainty of risk assessment, which has important consequences for societal measures.They formulate a deviation-based approach to quantify uncertainty for risks, which includes forms like:
ρ ξ (P Idx ) := E g∼PIdx [ξ(T (F g ))]
for different types of nonlinear functions ξ. Examples include variance uncertainty quantification, where
E g∼PIdx T (F g ) -E g∼PIdx T (F g ) 2 ; and E ψ [ξ(F -(α))] := 1 0 ξ(F -(α)
)ψ(α)dα to quantify how sensitive the α-VaR value is w.r.t its parameter α for some non-negative weight function ψ.

Section: Distribution-free control of societal dispersion measures
In this section, we introduce a simple yet general framework to obtain rigorous upper bounds on the statistical dispersion measures discussed in the previous section. We will provide a high-level summary of our framework in this section, and leave detailed derivations and most examples to the appendix. Our discussion will focus on quantities related to the inverse of CDFs, but similar results could be obtained for CDFs.
In short, our framework involves two steps: produce upper and lower bounds on the CDF (and thus inverse CDF) of the loss distribution, and use these to calculate bounds on a chosen target risk measure. First, we will describe our extension of the one-sided bounds in [30] to the two-sided bounds necessary to control many societal dispersion measures of interest. Then we will describe how these CDF bounds can be post-processed to provide control on risk measures defined by nonlinear functions and functionals of the CDF. Finally, we will offer a novel optimization method for tightening the bounds for a chosen, possibly complex, risk measure.

Section: Methods to obtain confidence two-sided bounds for CDFs
For loss values {X i } n i=1 , let X (1) ≤ . . . ≤ X (n) denote the corresponding order statistics. For the uniform distribution over [0,1], i.e. U(0, 1), let U 1 , . . . , U n ∼ iid U(0, 1) denote the corresponding order statistics U (1) ≤ . . . ≤ U (n) . We will also make use of the following:
Proposition 1. For the CDF F of X, if there exists two CDFs F U , F L such that F U ⪰ F ⪰ F L 2 , then we have F - L ⪰ F -⪰ F - U .
We use ( F δ n,L , F δ n,U ) to denote a (1 -δ)-confidence bound pair ((1 -δ)-CBP), which satisfies
P( F δ n,U ⪰ F ⪰ F δ n,L ) ≥ 1 -δ.
We extend the techniques developed in [30], wherein one-sided (lower) confidence bounds on the uniform order statistics are used to bound F . This is done by considering a one-sided minimum goodness-of-fit (GoF) statistic of the following form: S := min 1≤i≤n s i (U (i) ), where s 1 , . . . , s n : [0, 1] → R are right continuous monotone nondecreasing functions. Thus, P(∀i :
F (X (i) ) ≥ s - i (s δ )) ≥ 1 -δ, for s δ = inf r {r : P(S ≥ r) ≥ 1 -δ}.
Given this step function defined by s 1 , . . . , s n , it is easy to construct F δ n,L via conservative completion of the CDF. [30] found that a Berk-Jones bound could be used to choose appropriate s i 's, and is typically much tighter than using the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality to construct a bound.

Section: A reduction approach to constructing upper bounds of CDFs
Now we show how we can leverage this approach to produce two-sided bounds. In the following lemma we show how a CDF upper bound can be reduced to constructing lower bounds.
Lemma 1. For 0 ≤ L 1 ≤ L 2 • • • ≤ L n ≤ 1, if P(∀i : F (X (i) ) ≥ L i ) ≥ 1 -δ, then, we have P(∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1 ) ≥ 1 -δ. Furthermore, let R(x) be defined as 1 -L n if x < X (1) ; 1 -L n-i+1 if X (i) ≤ x < X (i+1) for i ∈ {1, 2, • • • , n -1}; 1 if X (n) ≤ x. Then, F ⪯ R.
Thus, we can simultaneously obtain ( F δ n,L , F δ n,U ) by setting L i = s - i (s δ ) and applying (different) CDF conservative completions. In practice, the CDF upper bound can be produced via post-processing of the lower bound. One clear advantage of this approach is that it avoids the need to independently produce a pair of bounds where each bound must hold with probability 1 -δ/2.

Section: Controlling statistical dispersion measures
Having described how to obtain the CDF upper and lower bounds ( F δ n,L , F δ n,U ), we next turn to using these to control various important risk measures such as the Gini coefficient and group differences. We will only provide high-level descriptions here and leave details to the appendix.

Section: Control of nonlinear functions of CDFs
First we consider bounding ξ(F -), which maps F -to another function of R.
Control for a monotonic function. We start with the simplest case, where ξ is a monotonic function in the range of X. For example, if ξ is an increasing function, and with probability at least 1 -δ,
F δ n,U ⪰ F ⪰ F δ n,L ; then further by Proposition 1, we have that ξ( F δ,- n,L ) ⪰ ξ( F -) ⪰ ξ( F δ,- n,U
) holds with probability at least 1 -δ. This property could be utilized to provide bounds for the Gini coefficient or Atkinson index by controlling the numerator and denominator separately as integrals of monotonic functions of
F -. Example 1 (Gini coefficient). If given a (1 -δ)-CBP ( F δ n,L , F δ n,U
) and F δ n,L ⪰ 03 , we can provide the following bound for the Gini coefficient. Notice that
G(X) = 1 0 (2p -1)F -(p)dp 1 0 F -(p)dp = 1 0 2pF -(p)dp 1 0 F -(p)dp -1.
Given F -(p) ≥ 0 (since we only consider non-negative losses, i.e. X is always non-negative), we know
G(X) ≤ 1 0 2p F δ,- n,L (p)dp 1 0 F δ,- n,U (p)dp -1,
with probability at least 1 -δ.
Control for absolute and polynomial functions. Many societal dispersion measures involve absolute-value functions, e.g., the Hoover index or maximum group differences. We must also control polynomial functions of inverse CDFs, such as in the CDFs of extreme observations. For any polynomial function
ϕ(s) = k=0 α k s k , if k is odd, s k is monotonic w.r.t. s; if k is even, s k = |s| k .
Thus, we can group α k s k according to the sign of α k and whether k is even or odd, and flexibly use the upper and lower bounds already established for the absolute value function and monotonic functions to obtain an overall upper bound.
Example 2. If we have (T δ L (F g ), T δ U (F g )) such that T δ L (F g ) ≤ T (F g ) ≤ T δ U (F g
) holds for all g we consider, then we can provide high probability upper bounds for
ξ(T (F g1 ) -T (F g2 ))
for any polynomial functions or the absolute function ξ. For example, with probability at least 1 -δ
|T (F g1 ) -T (F g2 )| ≤ max{|T δ U (F g1 ) -T δ L (F g2 )|, |T δ L (F g1 ) -T δ U (F g2 )|}.
Control for a general function. To handle general nonlinearities, we introduce the class of functions of bounded total variation. Roughly speaking, if a function is of bounded total variation on an interval, it means that its range is bounded on that interval. This is a very rich class including all continuously differentiable or Lipchitz continuous functions. The following theorem shows that such functions can always be decomposed into two monotonic functions.
Theorem 1. For ( F δ n,L , F δ n,U
), if ξ is a function with bounded total variation on the range of X, there exists increasing functions f 1 , f 2 with explicit and calculable forms, such that with probability at least
1 -δ, ξ(F -) ⪯ f 1 ( F δ,- n,L ) -f 2 ( F δ,- n,U ).
As an example, recall that [4] studies forms like
1 0 ξ(F -(α))ψ(α)dα
to quantify how sensitive the α-VaR value is w.r.t its parameter α. For nonlinear functions beyond polynomials, consider the example where ξ = e x + e -x . This can be readily bounded since it is a mixture of monotone functions.

Section: Control of nonlinear functionals of CDFs and beyond
Finally, we show how our techniques from the previous section can be applied to handle the general form
1 0 ξ(T (F -(α)))ψ(α)dα
, where ψ can be a general function (not necessarily non-negative) and ξ can be a functional of the inverse CDF. To control ξ(T (F -)), we can first obtain two-sided bounds for T (F -) if T (F -) is in the class of QBRMs or in the form of ψ(p)ξ 2 (F -(p))dp for some nonlinear function ξ 2 (as in [4]). We can also generalize the weight functions in QBRMs from non-negative to general weight functions once we notice that ψ can be decomposed into two non-negative functions, i.e. ψ = max{ψ, 0} -max{-ψ, 0}. Then, we can provide upper bounds for terms like max{ψ, 0}ξ(F -(p))dp by adopting an upper bound for ξ(F -).

Section: Numerical optimization towards tighter bounds for statistical functionals
Having described our framework for obtaining CDF bounds and controlling rich families of risk measures, we return to the question of how to produce the CDF bounds. One drawback of directly using the bound returned by Berk-Jones is that it is not weight function aware, i.e., it does not leverage knowledge of the target risk measures. This motivates the following numerical optimization method, which shows significant improvement over previous bounds including DKW and Berk-Jones bounds (as well as the truncated version proposed in [30]).
Our key observation is that for any
0 ≤ L 1 ≤ • • • ≤ L n ≤ 1, we have P ∀i, F (X (i) ) ≥ L i ≥ n! 1 Ln dx n xn Ln-1 dx n-1 • • • x2 L1 dx 1 , where the right-hand side integral is a function of {L i } n i=1
and its partial derivatives can be calculated exactly by the package in [21]. Consider controlling
1 0 ψ(p)F -(p)dp as an example. For any {L i } n i=1 satisfying P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, one can use conservative CDF completion to obtain F δ n,L , i.e. 1 0 ψ(p)ξ( F δ,- n,L (p))dp = n+1 i=1 ξ(X (i) ) Li Li-1 ψ(p)dp,
where L n+1 is 1, L 0 = 0, and X (n+1) = ∞ or a known upper bound for X. Then, we can formulate tightening the upper bound as an optimization problem:
min {Li} n i=1 n+1 i=1 ξ(X (i) ) Li Li-1 ψ(p)dp such that P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, and 0 ≤ L 1 ≤ • • • ≤ L n ≤ 1.
We optimize the above problem with gradient descent and a simple post-processing procedure to make sure the obtained { Li } n i=1 strictly satisfy the above constraints. In practice, we re-parameterize {L i } n i=1 with a network ϕ θ that maps n random seeds to a function of the L i 's, and transform the optimization objective from {L i } n i=1 to θ. We find that a simple parameterized neural network model with 3 fully-connected hidden layers of dimension 64 is enough for good performance and robust to hyper-parameter settings.
γ * = inf{γ : n!υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1) ≥ 1 -δ, γ ≥ 0}.
Notice that there is always a feasible solution. We can use binary search to efficiently find (a good approximate of) γ * .

Section: Experiments
With our experiments, we aim to examine the contributions of our methodology in two areas: bound formation and responsible model selection.

Section: Learn then calibrate for detecting toxic comments
Using the CivilComments dataset [6], we study the application of our approach to toxic comment detection under group-based fairness measures. CivilComments is a large dataset of online comments labeled for toxicity as well as the mention of protected sensitive attributes such as gender, race, and religion. Our loss function is the Brier Score, a proper scoring rule that measures the accuracy of probabilistic predictions, and we work in the common setting where a trained model is calibrated post-hoc to produce confidence estimates that are more faithful to ground-truth label probabilities. We use a pre-trained toxicity model and apply a Platt scaling model controlled by a single parameter to optimize confidence calibration. Our approach is then used to select from a set of hypotheses, determined by varying the scaling parameter in the range [0.25, 2] (where scaling parameter 1 recovers the original model). See the Appendix for more details on the experimental settings and our bound optimization technique.

Section: Bounding complex expressions of group dispersion
First, we investigate the full power of our framework by applying it to a complex statistical dispersion objective. Our overall loss objective considers both expected mean across groups as well as the maximum difference between group medians, and can be expressed as:
L = E g [T 1 (F g )] + λ sup g,g ′ |T 2 (F g ) -T 2 (F g ′ )|
, where T 1 is expected loss and T 2 is a smoothed version of a median (centered around β = 0.5 with spread parameter a = 0.1). Groups are defined by intersectional attributes: g ∈ G = {black female, white female, black male, white male}. We use 100 and 200 samples from each group, and select among 50 predictors. For each group, we use our numerical optimization framework to optimize a bound on O = T 1 (F g ) + T 2 (F g ) using the predictor (and accompanying loss distribution) chosen under the Berk-Jones method. Results are shown in Table 1. We compare our numerically-optimized bound (NN-Opt.) to the bound given by Berk-Jones as well as an application of the DKW inequality to lower-bounding a CDF.
Our framework enables us to choose a predictor that fits our specified fairness criterion, and produces reasonably tight bounds given the small sample size and the convergence rate of 1 √ n . Moreover, there is a large gain in tightness from numerical optimization in the case where n = 100, especially with respect to the bound on the maximum difference in median losses (0.076 vs. 0.016). These results show that a single bound can be flexibly optimized to improve on multiple objectives at once via our numerical method, a key innovation point for optimizing bounds reflecting complex societal concerns like differences in group medians [5]. 

Section: Optimizing bounds on measures of group dispersion
Having studied the effects of applying the full framework, we further investigate whether our method for numerical optimization can be used to get tight and flexible bounds on functionals of interest. First, β-CVaR is a canonical tail measure, and we bound the loss for the worst-off 1 -β proportion of predictions (with β = 0.75). Next, we bound a specified interval of the VaR ([0.5, 0.9]), which is useful when a range of quantiles of interest is known but flexibility to answer different queries within the range is important. Finally, we consider a worst-quantile weighting function ψ(p) = p, which penalizes higher loss values on higher quantiles, and study a smooth delta function around β = 0.5, a more robust version of a median measure. We focus on producing bounds using only 100 samples from a particular intersectionally-defined protected group, in this case black females, and all measures are optimized with the same hyperparameters. The bounds produced via numerical optimization (NN-Opt.) are compared to the bounds in [30] (as DKW has been previously shown to produce weak CDF bounds), including the typical Berk-Jones bound as well as a truncated version tailored to particular quantile ranges. See Table 2 and the Appendix for results.
The numerical optimization method induces much tighter bounds than Berk-Jones on all measures, and also improves over the truncated Berk-Jones where it is applicable. Further, whereas the truncated Berk-Jones bound will give trivial control outside of [β min , β max ], the numerically-optimized bound not only retains a reasonable bound on the entire CDF, but even improves on Berk-Jones with respect to the bound on expected loss in all cases. For example, after adapting to CVaR, the numericallyoptimized bound gives a bound on the expected loss of 0.23, versus 0.25 for Berk-Jones and 0.50 for Truncated Berk-Jones. Thus numerical optimization produces both the best bound in the range of interest as well as across the rest of the distribution, showing the value of adapting the bound to the particular functional and loss distribution while still retaining the distribution-free guarantee. Next, we aim to explore the application of our approach to responsible model selection under non-group-based fairness measures, and show how using our framework leads to a more balanced distribution of loss across the population. Further details for both experiments can be found in the Appendix.

Section: Controlling balanced accuracy in detection of genetic mutation
RxRx1 [31] is a task where the input is a 3-channel image of cells obtained by fluorescent microscopy, the label indicates which of 1,139 genetic treatments the cells received, and there is a batch effect that creates a challenging distribution shift across domains. Using a model trained on the train split of the RxRx1 dataset, we evaluate our method with an out-of-distribution validation set to highlight the distribution-free nature of the bounds. We apply a threshold to model output in order to produce prediction sets, or sets of candidate labels for a particular task instance. Prediction sets are scored with a balanced accuracy metric that equally weights sensitivity and specificity, and our overall objective is: L = T 1 (F ) + λT 2 (F ), where T 1 is expected loss, T 2 is Gini coefficient, and λ = 0.2. We choose among 50 predictors (i.e. model plus threshold) and use 2500 population samples to produce our bounds. Results are shown in Figure 2. 

Section: Producing recommendation sets for the whole population
Using the MovieLens dataset [12], we test whether better control on another important non-group based dispersion measure, the Atkinson index (with ϵ = 0.5), leads to a more even distribution of loss across the population. We train a user/item embedding model, and compute a loss that balances precision and recall for each set of user recommendations. Results are shown in Figure 3. Tighter control of the Atkinson index leads to a more dispersed distribution of loss across the population, even for subgroups defined by protected attributes like age and gender that are unidentified for privacy or security reasons. 

Section: Related work
The field of distribution-free uncertainty quantification has its roots in conformal prediction [27].
The coverage guarantees of conformal prediction have recently been extended and generalized to controlling the expected loss of loss functions beyond coverage [1,3]. The framework proposed by [30] offers the ability to select predictors beyond expected loss, to include a rich class of quantilebased risk measures (QBRMs) like CVaR and intervals of the VaR; they also introduce a method for achieving tighter bounds on certain QBRMs by focusing the statistical power of the Berk-Jones bound on a certain quantile range. Note that these measures cannot cover the range of dispersion measures studied in this work.
There is a rich literature studying both standard and group-based statistical dispersion measures, and their use in producing fairer outcomes in machine learning systems. Some work in fairness has aimed at achieving coverage guarantees across groups [24,25], but to our knowledge there has not been prior work exploring controlling loss functions beyond coverage, such as the plethora of loss functions aimed at characterizing fairness, which can be expressed as group-based measures (cf. Section 3.2).
Other recent fairness work has adapted some of the inequality measures found in economics. [32] aims to enforce that outcomes are not too different across groups defined by protected attributes, and introduces a convex notion of group CVaR, and [24] propose a DFUQ method of equalizing coverage between groups. [19] studies distributional inequality measures like Gini and Atkinson index since demographic group information is often unavailable, while [7] use the notion of Lorenz efficiency to generate rankings that increase the utility of both the worst-off users and producers.

Section: Conclusion
In this work, we focus on a rich class of statistical dispersion measures, both standard group-based, and show how these measures can be controlled. In addition, we offer a novel numerical optimization method for achieving tighter bounds on these quantities. We investigate the effects of applying our framework via several experiments and show that our methods lead to more fair model selection and tighter bounds. We believe our study offers a significant step towards the sort of thorough and transparent validation that is critical for applying machine learning algorithms to applications with societal implications.

Section: Appendix
Broader impact. The broader impact of the proposed framework is significant, as it extends the ability to gain trust in machine learning systems. However there are important concerns and limitations.
• Focus on performance metrics In this paper we propose a range of performance metrics, which extend well beyond standard metrics concerning expected loss. However, in many situations these metrics are not sufficient to capture the effects of the machine learning system. Often a number of different metrics are required to provide a clearer picture of model performance, while some effects are difficult to capture in any metric. Also, while the measures studied offer the ability to more evenly distribute a quantity across a population, they do not offer guarantees to individuals. Finally, achieving a more equal distribution of the relevant quantity (e.g., loss or income) may have negative impacts on some segments of the population. • Limitations These are summarized in the Conclusion but are expanded upon here. An important assumption in this work, and in distribution-free uncertainty quantification more generally, is that the examples seen in deployment are drawn from the same distribution as those in the validation set that are used to construct the bounds. Although this is an active area of research, here we make this assumption, and the quality of the bounds produced may degrade if the assumption is violated. A second limitation is that the scope of hypotheses and predictors we can select from is limited, due to theoretical constraints: a correction must be performed based on the size of the hypothesis set. Finally, the generated bounds may not be tight, depending on the amount of available validation data and unavoidable limits of the techniques used to produce the bounds. We did some comparisons to Empirical values of the measures we obtained bounds for in the experiments; more extensive studies would be useful to elucidate the value of the bounds in practice.
Organization of the Appendix. (1) In Appendix A, we provide detailed statements and derivations of our methodology presented Section 4.2.1, including how to obtain bounds for those measures mentioned in Section 3; (2) in Appendix B, we introduce further societal dispersion measures, beyond those presented in Section 3 and corresponding bounds; (3) in Appendix C, we investigate the extension of our results to multi-dimensional settings; (4) lastly, in Appendix D and E, we provide more complete details and results from our experiments (Section 5).

Section: A Derivations and proofs for bounding methods
Section A.1, we first consider how to control, or provide upper bounds on, various quantities when we are given ( F δ n,L , F δ n,U ), which are constructed by {X i } n i=1 , such that
P( F δ,- n,L ⪯ F ⪯ F δ,- n,U
) ≥ 1 -δ where the randomness is taken over {X i } n i=1 . Then, in Section A.2, we will show how we obtain ( F δ,n,L , F δ,n,U ) by extending the arguments in [30]. In addition, we show details in Section A.2.2 on how we go beyond the methods in [30] and provide a numerical optimization method for tighter bounds.
Proof of Proposition 1. We briefly describe the the proof for Proposition 1. The proof is mainly based on [30], but we include it here for completeness. Notice for any non-decreasing function G : R → R (not just a CDF), there exists the (general) inverse of G as G -(p) = inf{x : G(x) ≥ p} for any p ∈ R. Proposition 2 (Restatement of Proposition 1). For the CDF F of X, if there exists two increasing functions
F U , F L such that F U ⪰ F ⪰ F L , then we have F - L ⪰ F -⪰ F - U .
Proof. A.1 Control of nonlinear functions of CDFs (Section 4.2.1)

Section: A.1.1 Control for monotonic functions
Recall that we start with the simplest case where ξ is a monotonic function in the range of X. It is straightforward to have the following claim.
Claim 1. If we have F δ,n,L ⪯ F ⪯ F δ,n,U with probability at least 1 -δ for some δ ∈ (0, 1), if ξ is an increasing function, then
ξ( F δ,- n,L ) ⪰ ξ( F -) ⪰ ξ( F δ,- n,U ) with probability at least 1 -δ. Similarly, if ξ is a decreasing function, then ξ( F δ,- n,L ) ⪯ ξ( F -) ⪯ ξ( F δ,- n,U
) with probability at least 1 -δ.
We show how this could be applied to provide bounds for Gini coefficient and Atkinson index by controlling the numerator and denominator separately as integrals of monotonic functions of F -.
Example 3 (Gini coefficient). If given a (1 -δ)-CBP ( F δ n,L , F δ n,U
) and F δ n,L ⪰ 04 , we can provide the following bound for the Gini coefficient. Notice that
G(X) = 1 0 (2p -1)F -(p)dp 1 0 F -(p)dp = 1 0 2pF -(p)dp 1 0 F -(p)dp -1.
Given F -(p) ≥ 0 (since we only consider non-negative losses, i.e. X is always non-negative), we know
G(X) ≤ 1 0 2p F δ,- n,L (p)dp 1 0 F δ,- n,U (p)dp -1,
with probability at least 1 -δ.
Example 4 (Atkinson index). First, we present the complete version of Atkinson index. Namely,
A(ε, X) :=        1 - 1 0 (F -(p)) 1-ε dp 1 1-ε 1 0 F -(p)dp , if ε ≥ 0, ε ̸ = 1; 1 - exp( 1 0 ln(F -(p))dp) 1 0 F -(p)dp , if ε = 1.
Notice that for ε ≥ 0, (•) 1-ε and ln(•) are increasing functions, thus, for Atkinson index and a (1 -δ)-
CBP ( F δ n,L , F δ n,U ), if F δ n,L ⪰ 0, let us define A δ U (ε, X) := 1- 1 0 ( F δ,- n,U (p)) 1-ε dp 1 1-ε 1 0 F δ,- n,L (p)dp , if ε ≥ 0, ε ̸ = 1; 1 - exp( 1 0 ln( F δ,- n,U (p))dp) 1 0 F δ,- n,L (p)dp , if ε = 1. Then, with probability at least 1 -δ, A δ U (ε, X) is an upper bound for A(ε, X) for all ε ∈ [0, 1).
As mentioned in Remark 1, instead of calculating bounds separately for each ε, simple post-processing enables us to efficiently issue a family of bounds. Example 5 (CVaR fairness-risk measures and beyond). Recall that for α ∈ (0, 1),
D CV,α (T (F g )) = min ρ∈R ρ + 1 1 -α • E g∼PIdx [T (F g ) -ρ] + -E g∼PIdx [T (F g )].
The function [T (F g ) -ρ] + is an increasing function when ρ is fixed and its further composition with the expectation operation is still increasing. If we have 5 for all g with probability at least 1 -δ, then we have
(T δ L (F g ), T δ U (F g )) such that T δ L (F g ) ≤ T (F g ) ≤ T δ U (F g )
D CV,α (T (F g )) ≤ min ρ∈R ρ + 1 1 -α • E g∼PIdx [T δ U (F g ) -ρ] + -E g∼PIdx [T δ L (F g )],
and the first term of RHS can be minimized easily since it is a convex function of ρ.

Section: A.1.2 Control for absolute and polynomial functions
Recall that if s L ≤ s ≤ s U , then
s L 1{s L ≥ 0} -s U 1{s U ≤ 0} ≤ |s| ≤ max{|s U |, |s L |}.
More generally, for any polynomial function ϕ(s) = k=0 α k s k . Notice if k is odd, s k is monotonic w.r.t. s and we can bound
ϕ(s) ≤ {k is odd, α k ≥0} α k s k U + {k is odd, α k <0} α k s k L + {k is even, α k ≥0} α k max{|s L | k , |s U | k } + {k is even, α k <0} α k (s L 1{s L ≥ 0} -s U 1{s U ≤ 0}) k .
So, for ϕ(F -), we can plug in F δ,n,L and F δ,n,U to replace s U and s L to obtain an upper bound with probability at least (1 -δ). The derivation for the lower bound is similar. We summarize our results as the following proposition.
Proposition 3. If given a (1 -δ)-CBP, then with probability at least 1 -δ, ( F δ n,L , F δ n,U ), F δ,- n,U 1{ F δ,- n,U ≥ 0} -F δ,- n,L 1{ F δ,- n,L ≤ 0} ⪯ |F -| ⪯ max{| F δ,- n,L |, | F δ,- n, |}.
Moreover, for any polynomial function ϕ(s) = k=0 α k s k , we have
ϕ(F -) ⪯ {k is odd, α k ≥0} α k ( F δ,- n,L ) k + {k is odd, α k <0} α k ( F δ,- n,U ) k + {k is even, α k ≥0} α k max{| F δ,- n,U | k , | F δ,- n,L | k } + {k is even, α k <0} α k ( F δ,- n,U 1{ F δ,- n,U ≥ 0} -F δ,- n,L 1{ F δ,- n,L ≤ 0}) k . Example 6. If we have (T δ L (F g ), T δ U (F g )) such that T δ L (F g ) ≤ T (F g ) ≤ T δ U (F g
) holds for all g we consider, then we can provide high probability upper bounds for
ξ(T (F g1 ) -T (F g2 ))
for any polynomial functions or the absolute function ξ. For example, with probability at least 1 -δ,
|T (F g1 ) -T (F g2 )| ≤ max{|T δ U (F g1 ) -T δ L (F g2 )|, |T δ L (F g1 ) -T δ U (F g2 )|}.
We will further show in Appendix B how our results are applied to specific examples.

Section: A.1.3 Control for a general function
To handle general non-linearity, we need to introduce the class of functions of bounded variation on a certain interval, which is a very rich class that includes all the functions that are continuously differentiable or Lipchitz continuous on that interval. Definition 4 (Functions of bounded total variation [26]). Define the set of paritions on [a, b] as
Π = {π = (x 0 , x 1 , • • • , x nπ ) | π is a partition of [a, b] satisfying x i ≤ x i+1 for all 0 ≤ i ≤ n π -1}.
Then, the total variation of a continuous real-valued function ξ, defined on [a, b] ⊂ R is defined as
V b a (ξ) := sup π∈Π nπ i=0 |ξ(x i+1 ) -ξ(x i )|
where Π is the set of all partitions, and we say a function ξ is of bounded variation, i.e.
ξ ∈ BV([a, b]) iff V b a (ξ) < ∞.
Recall that X ≥ 0 in our cases, then, for ξ(F -), we can have the following bound.
Theorem 2 (A restatement & formal version of Theorem 1). For a (1 -δ)-CBP ( F δ n,L , F δ n,U ), for any p ∈ [0, 1] such that the total variation of ξ is finite on [0, F δ,n,L (p)], then
ξ(F -(p)) ≤ V F δ,- n,L (p) 0 (ξ) -V F δ,- n,U (p) 0 (ξ) + ξ( F δ,- n,U (p)).
Moreover, if ξ is continuously differentiable on [0, F δ,n,L (p)], we can express V s 0 (ξ) as
x 0 | dξ ds (s)|ds for any x ∈ [0, F δ,- n,L (p)].
Proof. By the property of functions of bounded total variation [26], if ξ is of bounded total variation on [0, F δ,n,L (p)], then, we have that: for any
x ∈ [0, F δ,- n,L (p)] ξ(x) = V x 0 (ξ) -(V x 0 (ξ) -ξ(x))
where both f 1 (x) := V x 0 (ξ) and f 2 (x) := V x 0 (ξ) -ξ(x) are increasing functions. Moreover,
V x 0 (ξ) = x 0 dξ ds (s) ds if ξ is continuously differentiable.
Thus, by taking advantage of the monotonicity, we have
ξ(F -(p)) ≤ V F δ,- n,L (p) 0 (ξ) -V F δ,- n,U (p) 0 (ξ) + ξ( F δ,- n,U (p)).
So, if ξ is of bounded variation on the range of X, then
ξ(F -) ⪯ V F δ,- n,L 0 (ξ) -V F δ,- n,U 0 (ξ) + ξ( F δ,- n,U ) = f 1 ( F δ,- n,L ) -f 2 ( F δ,- n,U ).
A.2 Methods to obtain confidence two-sided bounds for CDFs (Section 4.1)
We provide details for two-sided bounds and our numerical methods in the following.
A.2.1 The reduction approach to constructing upper bounds of CDFs (Section 4.1.1)
We here provide the proof of Lemma 1. [30], if we further have P(∀i :
Lemma 2 (A restatement & formal version of Lemma 1). For 0 ≤ L 1 ≤ L 2 • • • ≤ L n ≤ 1, since P(∀i : F (X (i) ) ≥ L i ) ≥ P(∀i : U (i) ≥ L i ) by
U (i) ≥ L i ) ≥ 1 -δ, then we have P(∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1 ) ≥ 1 -δ. Furthermore, let R(x) be defined as R(x) =              1 -L n , for x < X (1) 1 -L n-1 , for X (1) ≤ x < X (2) . . . 1 -L 1 , for X (n-1) ≤ x < X (n) 1, for X (n) ≤ x.
Then, F ⪯ R.
Proof. Notice that for given order statistics {X (i) } n i=1 , let P {X (i) } n i=1 denote the probability taken over the randomness of {X (i) } n i=1 , and P X denote the probability taken over the randomness of X, which is an independent random variable drawn from F . Let us denote B = -X, and B (i) as the i-th order statistic for samples {-X i } n i=1 . It is easy to see that B (n-i+1) = -X (i) . We also denote P B as the probability taken over the randomness of B, and F B as the CDF of B.
P {X (i) } n i=1 (∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1 ) =P {X (i) } n i=1 (∀i : P X (X ≥ X (i) ) > L n-i+1 ) =P {X (i) } n i=1 (∀i : P X (-X ≤ -X (i) ) > L n-i+1 ) =P {X (i) } n i=1 (∀i : P B (B ≤ B (n-i+1) ) > L n-i+1 ) =P(∀i : F B • F - B (U (n-i+1)) > L n-i+1 ) ≥P(∀i : U (n-i+1) > L n-i+1 ).
where we use the fact that F - B (U (n-i+1) ) is of the same distribution as B (n-i+1) and the last inequality follows from Proposition 1, eq. 24 on p.5 of [28].
Notice that P(∀i : U (n-i+1) > L n-i+1 ) = P(∀i : U (n-i+1) ≥ L n-i+1 ), and according to [30] and our assumption, P(∀i :
F (X (i) ) ≥ L i ) ≥ P(∀i : U (i) ≥ L i ) ≥ 1 -δ. The conservative construction of R satisfies R ⪰ F straightforwardly if ∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1
) holds. Thus, we know R ⪰ F with probability at least 1 -δ. Our proof is complete. 

Section: A.2.2 Details of numerical optimization method (Section 4.3)
Now, we introduce the details of our numerical optimization method. Recall that one drawback of the QBRM bounding approach is that it is not weight function aware: when controlling 1 0 ψ(p)F -(p)dp for a non-negative weight function ψ, the procedure ignores the structure of ψ, as it first obtains F δ n,L , then provides an upper bound 1 0 ψ(p) F δ,n,L (p)dp. Our numerical approach can overcome that drawback and can also easily be applied to handle mixtures of multiple functionals. The bounds obtained by our method are significantly tighter than those provided by methods in [30] in the regime of small data size. Notice that the small data size regime is the one people care about because when the data size is large, all the bounds we discussed will converge to the same value, and the gap between different bounds will shrink to 0 as the data size grows. First, by [? ] and Proposition 1, eq. 24 on p.5 of [28], we have for any
0 ≤ L 1 ≤ • • • ≤ L n ≤ 1, P ∀i, F (X (i) ) ≥ L i ≥ P ∀i, U (i) ≥ L i ≥ n! 1 Ln dx n xn Ln-1 dx n-1 • • • x2 L1 dx 1 ,
where the right-hand side integral is a function of {L i } n i=1 and its partial derivatives can be exactly calculated by the package in [21]. Specifically, the package in [21] enables us to calculate
υ(L 1 , L 2 , • • • , L n , 1) := 1 Ln dx n xn Ln-1 dx n-1 • • • x2 L1 dx 1
for any positive integer n. Notice that the partial derivative of υ(L 1 , L 2 , • • • , L n , 1) with respect to L i is:
∂ Li υ(L 1 , L 2 , • • • , L n , 1) = - 1 Ln dx n xn Ln-1 dx n-1 • • • xi+2 Li+1 dx i+1 • Li Li-1 dx i-1 • • • x2 L1 dx 1 , = -υ(L i+1 , • • • , L n , 1) • υ(L 1 , • • • , L i-1 , L i ),
which we can also use the package in [21] to calculate the partial derivatives.
Consider providing upper or lower bounds for 1 0 ψ(p)F -(p)dp for non-negative weight function ψ as an example. For any [30] to obtain F δ n,L , i.e.
{L i } n i=1 satisfying P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, one can use conservative CDF completion in
1 0 ψ(p)ξ( F δ,- n,L (p))dp = n+1 i=1 ξ(X (i) )
Li Li-1 ψ(p)dp, where L n+1 is 1, L 0 = 0, and X (n+1) = ∞ or a known upper bound for X. Then, we can formulate tightening the upper bound as an optimization problem:
min {Li} n i=1 n+1 i=1 ξ(X (i) ) Li Li-1 ψ(p)dp such that P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, and 0 ≤ L 1 ≤ • • • ≤ L n ≤ 1.
Similarly, for the lower bound, we can use the CDF completion mentioned in Theorem 1, and construct F δ n,U , then, we can study the following lower bound for
1 0 ψ(p)F -(p)dp, n i=1 ξ(X (i) ) Ln-i+1 Ln-i ψ(p)dp
where X (0) = 0.
Parameterized model approach. Notice the above optimization problem formulation has a drawback: if more samples are drawn, i.e. n increases, then the number of parameters we need to optimize also increases. In practice, we re-parameterize {L i } n i=1 as the following:
L i (θ) = i j=1 exp(ϕ θ (g j )) 1 + n j=1 exp(ϕ θ (g j ))
where g i are random Gaussian seeds. This is of the same spirit as using random seeds in generative models. We find that a simple parameterized neural network model with 3 fully-connected hidden layers of dimension 64 is enough for good performance and robust to hyper-parameter settings. Take the upper bound optimization problem as an example; using the new parameterized model, we have
min {θ} n i=1 n+1 i=1 ξ(X (i) ) Li(θ) Li-1(θ) ψ(p)dp (1) such that n! 1 Ln(θ) dx n xn Ln-1(θ) dx n-1 • • • x2 L1(θ) dx 1 ≥ 1 -δ,
where L 0 = 0, L n+1 = 1, X (n+1) = ∞ or a known upper bound for X. We can solve the above optimization problem using heuristic methods such as [9].
Post-processing for a rigorous guarantee for constraints. Notice that we may not ensure the constraint n!
1 Ln(θ) dx n xn Ln-1(θ) dx n-1 • • • x2 L1(θ) dx 1 ≥ 1 -δ is
satisfied in the above optimization because we may use surrogates like Langrange forms in our optimization processes. To make sure the constraint is strictly satisfied, we can do the following post-processing: let us denote the obtained L i 's by optimizing (1) as L i ( θ). Then, we look for γ * ∈ [0, L n ( θ)] such that
γ * = inf{γ : n!υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1) ≥ 1 -δ, γ ≥ 0}.
Notice there is always a feasible solution as when γ = L n ( θ),
n!υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1) ≥ P ∀i, U (i) ≥ 0 = 1 and υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1
) is a decreasing function of γ. We can use binary search to efficiently find (a good approximate of) γ * . 

Section: B Other dispersion measures and calculation
L(t) = t 0 F -1 (p) dp 1 0 F -1 (p) dp .
We can obtain a lower bound and an upper bound function for the Lorenz curve. Given a (1 -δ)-CBP ( F δ n,L , F δ n,U ) and F δ n,L ⪰ 0, we can construct a lower bound function L δ L (t):
L δ L (t) = t 0 F δ,- n,U (p) dp 1 0 F δ,- n,L(p) dp
, and an upper bound can be obtained by
L δ U (x) = t 0 F δ,- n,L (p) dp 1 0 F δ,- n,U(p) dp
.
With probability at least 1 -δ, the true Lorenz curve sits between the upper bound function and the lower bound function for all t ∈ [0, 1].
The extended Gini family. The Gini coefficient can further give rise to the extended Gini family, which is a family of variability and inequality measures that depends on one parameter -the extended Gini parameter. The definition is as follows. Definition 6 (The extended Gini family [34]). The extended Gini coefficient is given by
G(ν, X) : = -νCov(X, [1 -F (X)] ν-1 ) E[X] = 1 - ν 1 0 (1 -p) ν-1 F -(p)dp 1 0 F -(p)dp ,
where ν > 0 is the extended Gini parameter and Cov(•, •) is the covariance.
For the extended Gini coefficient, choosing different ν's corresponds to different weighting schemes applied to the vertical distance between the egalitarian line and the Lorenz curve; and if ν = 2, it is the standard Gini coefficient.
Given a (1 -δ)-CBP ( F δ n,L , F δ n,U ) and F δ n,L ⪰ 0, we can construct upper bound for G. Let
G δ U (ν, X) := 1 - ν 1 0 (1 -p) ν-1 F δ,- n,U (p)dp 1 0 F δ,- n,L (p)dp , then G δ U (ν, X) ⪰ G(ν, X
) with probability at least 1 -δ.

Section: B.2 Generalized entropy index
The generalized entropy index [29] is another measure of inequality in a population. Specifically, the definition is: for real number α GE(α, X) :=
               1 α(α-1) E X EX α -1 , α ̸ = 0, 1 E X EX ln( X EX ) , if α = 1 -E ln( X EX ) , if α = 0.
It is not hard to further expand the expressions and write the generalized entropy index as:
GE(α, X) :=                  1 α(α-1) 1 0 F -(p) 1 0 F -(p)dp α -1 dp, α ̸ = 0, 1 1 0 F -(p) 1 0 F -(p)dp ln( F -(p) 1 0 F -(p)dp ) dp, if α = 1 - 1 0 ln( F -(p) 1 0 F -(p)dp ) dp, if α = 0.
Notice that (•) α is a monotonic function for the case α ̸ = 0, 1, and ln(•) is also a monotonic function, so the bound can be obtained similarly as in the case of Atkinson index. For instance, for α > 1, given a (1 -δ)-CBP ( F δ n,L , F δ n,U ),
1 α(α -1) 1 0 F -(p) 1 0 F -(p)dp α -1 dp ≤ 1 α(α -1) 1 0 F δ,- n,L (p) 1 0 F δ,- n,U (p)dp α -1 dp.
Other cases can be tackled in a similar way, which we will not reiterate here.

Section: B.3 Hoover index
The Hoover index [16] is equal to the percentage of the total population's income that would have to be redistributed to make all the incomes equal. Definition 7 (Hoover index). For a non-negative random variable X, the Hoover index is defined as  For Hoover index and a (1 -δ)-CBP ( F δ n,L , F δ n,U ), let us define
H(X) = 1 0 |F -(p) - 1 0 F -(q)dq|dp
H U (X) = 1 0 max{| F δ,- n,L (p) - 1 0 F δ,- n,U (q)dq|, | F δ,- n,U (p) - 1 0 F δ,- n,L (q)dq|}dp 2 1 0 F δ,- n,U (p)dp .
Then, with probability at least 1 -δ, H U (, X) is an upper bound for H(X).

Section: B.4 Extreme observations & mean range
For example, a city may need to estimate the cost of damage to public amenities due to rain in a certain month. The loss for each day of a month is X 1 , • • • , X k i.i.d drawn from F , and the administration hopes to estimate and control the dispersion of the losses in a month so that they can accurately allocate resources. This involves quantities such as range (max i∈[k] X i -min j∈[k] X j ) or quantiles of extreme observations (max i∈[k] X i ). The CDF of extreme observations such as max i∈[k] X i involves a nonlinear function of F , i.e. (F (x)) k . Example 7 (Quantiles of extreme observations). The CDF of max
i∈[k] X i is F k . Thus, by the result of Appendix A.1.2, if given a (1 -δ)-CBP ( F δ n,L , F δ n,U ) and 1 ⪰ F δ,- n,U ⪰ F δ,- n,L ⪰ 0, with probability at least 1 -δ, ( F δ,- n,L ) k ⪯ F k ⪯ ( F δ,- n,U ) k . We also have ( F δ,- n,U ) k ⪯ F k ⪯ ( F δ,- n,L ) k . Similarly, for min i∈[k] X i , the CDF is 1 -(1 -F ) k , thus, we have 1 -(1 -F δ,- n,U ) k ⪯ F k ⪯ 1 -(1 -F δ,- n,L ) k .
We also want to emphasize, even if, X is not necessarily non-negative, we can apply the polynomial method in Appendix A.1.2 for F δ,n,U and F δ,n,L . Example 8 (Mean range). By [13], if we further have prior knowledge that X is of continuous distribution, the mean of max i∈[k] X i -min j∈[k] X j can be expressed as:
k F -(x)[F k-1 (x) -F k (x)]dF (x) = k 1 0 F -(F -(p))[F k-1 (F -(p)) -F k (F -(p))]dp Notice that both F and F -are increasing. Thus, if given a (1 -δ)-CBP ( F δ n,L , F δ n,U ), F δ n,L ⪰ 0, then with probability at least 1 -δ, 1 0 F δ,- n,L F δ,- n,L (p) ( F δ n,U ) k F δ,- n,L (p) -( F δ n,L ) k F δ,- n,U (p) dp
is an upper bound of the mean range.
There are many other interesting societal dispersion measures that could be handled by our framework, such as those in [19]. For example, they study tail share that captures "the top 1% of people own X share of wealth", which could be easily handled with the tools provided here. We will leave those those examples to readers.

Section: C Extension to multi-dimensional cases and applications
We briefly discuss extending our approach to multi-dimensional losses. Unfortunately, there is not a gold-standard definition of quantiles in the multi-dimensional case, and thus we only discuss functionals of CDFs and provide an example. For multi-dimensional samples
{X i } n i=1 , each of k dimensions, i.e. X i = (X i 1 , • • • , X i k ), for any k-dimensional vector x = (x 1 , • • • , x k ), define empirical CDF Fn (x) = 1 n n i=1 1{X i ⪯ x}.
where we abuse the notation ⪯ to mean all of X i 's coordinates are smaller than x's.
By classic DKW inequality, we have with probability at least 1 -δ,
| Fn (x) -F (x)| ≤ ln(k(n + 1)/δ) 2n.
Meanwhile, we can further adopt Frechet-Hoeffeding bound, which gives,
max{1 -k + k i=1 F i (x i ), 0} ≤ F (x) ≤ min{F 1 (x 1 ), • • • , F k (x k )}
where F i is the CDF of the i-th coordinate. Then, we can construct (
F δ/k,i n,L , F δ/k,i n,U ) such that ( F δ/k,i n,L ⪯ F i ⪯ F δ/k,i n,U )
, with probability at last 1 -δ/k. Thus, by union bound,
max{1 -k + k i=1 F δ/k,i n,L (x i ), 0} ≤ F (x) ≤ min{ F δ/k,1 n,U (x 1 ), • • • , F δ/k,k n,U (x k )}
for all x with probability at last 1 -δ.
We have
F (x) ≥ max{1 -k + k i=1 F δ/k,i n,L (x i ), 0, Fn (x) - ln(k(n + 1)/δ) 2n } F (x) ≤ min{ F δ/k,1 n,U (x 1 ), • • • , F δ/k,k n,U (x k ), Fn (x) + ln(k(n + 1)/δ) 2n }
with probability at last 1 -2δ.
Example 9 (Gini correlation coefficient [34]). The Gini correlation coefficient for two non-negative random variable X and Y are defined as
Γ X,Y := Cov(X, F Y (Y )) Cov(X, F X (X)) = F X,Y (x, y) -F X (x)F Y (Y ) dxdF Y (y) Cov(X, F X (X)) ,
where F X , F Y are marginal of X, Y and F X,Y is the joint CDF. One can use the multidimensional CDF bounds and our previous methods to provide bounds for the Gini correlation coeffiecient.

Section: D Experiment details
This section contains additional details for the experiments in Section 5. We set δ = 0.05 (before statistical corrections for multiple tests) in all experiments unless otherwise explicitly stated. Whenever we are bounding measures on multiple hypotheses, we perform a correction for the size of the hypothesis set. Additionally, when we bound measures on multiple distributions (e.g. demographic groups), we also perform a correction. Our code will be released publicly upon the publication of this article.
D.1 CivilComments (Section 5.1)
Our set of hypotheses are a toxicity model combined with a Platt scaler [22], where the model is fixed and we vary the scaling parameter in the range [0.25, 2] while fixing the bias term to 0. We use a pre-trained toxicity model from the popular python library Detoxify6 [10] and perform Platt Scaling using code from the python library released by [18] 7 . A Platt calibrator produces output according to:
h(v) = 1 1 + exp(wv + b)
where w, b are learnable parameters and v is the log odds of the prediction. Thus we form our hypothesis set by varying the parameter w while fixing b to 0. Examples are drawn from the train split of CivilComments, which totals 269,038 data points.
The loss metric for our CivilComments experiments is the Brier Score. For n data points, Brier score is calculated as:
L = 1 n n i=1 (f i -o i ) 2
where f i is prediction confidence and o i is the outcome (0 or 1).
D.1.1 Bounding complex objectives (Section 5.1.1)
We randomly sample 100,000 test points for calculating the empirical values in Table 1, and draw our validation points from the remaining data. We perform a Bonferroni correction on δ = 0.05 for the size of the set of hypotheses as well as the number of distributions on which we bound our measures (in this case the number of groups, 4). We set λ = 1.0.
Numerical optimization details (including training strategy and hyperparameters) are the same as Section 5.1.2, explained below in Appendix D.1.2. For group g we optimize the objective
O = T 1 (F g ) + T 2 (F g )
where F g is the CDF bound for group, T 1 is expected loss, and T 2 is a smoothed version of a median with a = 0.01 (see Appendix D.1.2 and Figure 5).
For comparison, the DKW inequality is applied to get a CDF lower bound, which is then transformed to an upper bound via the reduction approach in Section 4. We parameterize the bounds with a fully connected network with 3 hidden layers of dimension 64.
The n gaussian seeds are of size 32, which is also the input dimension for the network. Training is performed in two stages, where the network is first trained to approximate a Berk-Jones bound, and then optimized for some specified objective O. In both stages of training we aim to push the training error to zero or as close as possible (i.e. "overfit"), since we are optimizing a bound and do not seek generalization. The model is first trained for 100,000 epochs to output the Berk-Jones bound using a mean-squared error loss. Then optimization on O is performed for a maximum of 10,000 epochs, and validation is performed every 25 epochs, where we choose the best model according to the bound on O. Both stages of optimization use the Adam optimizer with a learning rate 0.00005, and for the second stage the constraint weight is set to λ = 0.00005. We perform post-processing to ensure the constraint holds (see Section A.2.2). For some denominator m (in our case m = 10 6 ) we set γ = 1 m , 2 m , 3 m , ... and check the constraint until it is satisfied. This approach is applied to both the experiments in Section 5.1.1 and Section 5.1.2. Details on the objective for Section 5.1.1 are above in Appendix D.1.1. In Section 5.1.2, we set δ = 0.01 and our metrics for optimization are described below:
CVaR CVaR is a measure of the expected loss for the items at or above some quantile level β. We set β = 0.75, and thus we bound the expected loss for the worst-off 25% of the population.

Section: VaR-Interval
In the event that different stakeholders are interested in the VaR for different quantile levels β, we may want to select a bound based on some interval of the VaR [β min , β max ]. We perform our experiment with β min = 0.5, β max = 0.9, which includes the median (β = 0.5) through the worst-case loss exluding a small batch of outliers (β = 0.9).

Section: Quantile-Weighted
We apply a weighting function to the quantile loss ψ(p) = p, such that the loss incurred by the worst-off members of a population are weighted more heavily.

Section: Smoothed Median
We study a more robust version of a median:
ψ(p; β) = 1 a √ π exp(- (p -β) 2 a 2 )
with β = 0.5 and a = 0.01, similar to a normal distribution extremely concentrated around its mean. See Figure 5 for an illustration of such a weighting.

Section: D.2 Bounds on standard measures (Section 5.2)
This section contains additional details for the experiments in Section 5.2. where Y is the set of ground truth labels (which in this experiment will always be one label), Ŷ is a set of predictions, and k is the number of classes.

Section: D.2.2 MovieLens-1M (Section 5.2.2)
MovieLens-1M [12] is a publicly available dataset. We filter all ratings below 5 stars, a typical pre-processing step, and filter any users with less than 15 5-star ratings, leaving us with 4050 users. For each user, the 5 most recently watched items are added to the test set, while the remaining (earlier) items are added to the train set. We train a user/item embedding model using the popular python recommender library LightFM9 with a WARP ranking loss for 30 epochs and an embedding dimension of 16.
For recommendation set Î we compute a loss combining recall and precision against a user test set I of size k: L = αl r ( Î, I)   In both cases the optimized bounds are tightest on both the target metric as well as the mean, illustrating the power of adaptation both to particular quantile ranges as well as real loss distributions.

Section: Acknowledgements
We thank the Google Cyber Research Program and ONR (Award N00014-23-1-2436) for their generous support. J. Snell gratefully acknowledges financial support from the Schmidt DataX Fund at Princeton University made possible through a major gift from the Schmidt Futures Foundation.


References:
[b0] Anastasios N Angelopoulos; Stephen Bates; Emmanuel J Candès; Michael I Jordan; Lihua Lei (2021-11). Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control. 
[b1]  Anthony B Atkinson (1970). On the Measurement of Inequality. Journal of Economic Theory
[b2] Stephen Bates; Anastasios Angelopoulos; Lihua Lei; Jitendra Malik; Michael I Jordan (2021-08). Distribution-Free, Risk-Controlling Prediction Sets. 
[b3] Mohammed Berkhouch; Ghizlane Lakhnati; Marcelo Brutti; Righi  (2019). Spectral Risk Measures and Uncertainty. 
[b4] Neil Bhutta; Andrew Chang; Lisa Dettling; Joanne Hsu (2020). Disparities in Wealth by Race and Ethnicity in the 2019 Survey of Consumer Finances. FEDS Notes
[b5] Daniel Borkan; Lucas Dixon; Jeffrey Sorensen; Nithum Thain; Lucy Vasserman (2019). Nuanced metrics for measuring unintended bias with real data for text classification. 
[b6] Virginie Do; Sam Corbett-Davies; Jamal Atif; Nicolas Usunier (2021). Two-Sided Fairness in Rankings via Lorenz Dominance. Advances in Neural Information Processing Systems
[b7] Kevin Dowd; David Blake (2006-06). After VaR: The Theory, Estimation, and Insurance Applications of Quantile-Based Risk Measures. Journal of Risk & Insurance
[b8] Chengyue Gong; Xingchao Liu (2021). Bi-Objective Trade-off with Dynamic Barrier Gradient Descent. 
[b9] Laura Hanu (2020). Detoxify. Github. 
[b10] Moritz Hardt; Eric Price; Nati Srebro (2016). Equality of Opportunity in Supervised Learning. Advances in Neural Information Processing Systems
[b11] F Maxwell; Joseph A Harper;  Konstan (2015). The Movielens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems
[b12] Hartley Ho; H A David (1954). Universal Bounds for Mean Range and Extreme Observation. The Annals of Mathematical Statistics
[b13] Ursula Hébert-Johnson; Michael Kim; Omer Reingold; Guy Rothblum (2018). Multicalibration: Calibration for the (Computationally-Identifiable) Masses. PMLR
[b14] Michael Kearns; Seth Neel; Aaron Roth; Zhiwei Steven Wu (2018). Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. PMLR
[b15] Ichiro Bruce P Kennedy; Deborah Kawachi;  Prothrow-Stith (1996). Income Distribution and Mortality: Cross-Sectional Ecological Study of the Robin Hood Index in the United States. British Medical Journal
[b16] Pang Wei Koh; Shiori Sagawa; Henrik Marklund; Sang Michael Xie; Marvin Zhang; Akshay Balsubramani; Weihua Hu; Michihiro Yasunaga; Richard Lanas Phillips; Irena Gao; Tony Lee; Etienne David; Ian Stavness; Wei Guo; Berton A Earnshaw; Imran S Haque; Sara Beery; Jure Leskovec; Anshul Kundaje; Emma Pierson; Sergey Levine; Chelsea Finn; Percy Liang (2021). WILDS: A Benchmark of in-the-Wild Distribution Shifts. 
[b17] Ananya Kumar; Percy S Liang; Tengyu Ma (2019). Verified Uncertainty Calibration. Advances in Neural Information Processing Systems
[b18] Tomo Lazovich; Luca Belli; Aaron Gonzales; Amanda Bower; Uthaipon Tantipongpipat; Kristian Lum; Ferenc Huszar; Rumman Chowdhury (2022). Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics. Patterns
[b19] Jean-Yves Le Boudec (2005-11-04). Rate Adaptation, Congestion Control and Fairness: A Tutorial. Web Page
[b20] Amit Moscovich; Boaz Nadler; Clifford Spiegelman (2016). On the Exact Berk-Jones Statistics and Their p-Value Calculation. Electronic Journal of Statistics
[b21] John Platt (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers
[b22] Stanislav Tyrrell Rockafellar;  Uryasev (2000). Optimization of Conditional Value-at-Risk. Journal of Risk
[b23] Yaniv Romano; Rina Foygel Barber; Chiara Sabatti; Emmanuel Candès (2020). With Malice Toward None: Assessing Uncertainty via Equalized Coverag. Harvard Data Science Review
[b24] Yaniv Romano; Stephen Bates; Emmanuel J Candès (2020-06). Achieving Equalized Odds by Resampling Sensitive Attributes. 
[b25] Lawrence Halsey; Royden ; Patrick Fitzpatrick (1988). Real Analysis. Macmillan
[b26] Glenn Shafer; Vladimir Vovk (2008). A Tutorial on Conformal Prediction. Journal of Machine Learning Research
[b27] Galen R Shorack; Jon A Wellner (2009-01). Empirical Processes with Applications to Statistics. 
[b28] F Anthony;  Shorrocks (1980). The Class of Additively Decomposable Inequality Measures. Econometrica: Journal of the Econometric Society
[b29] Jake C Snell; Thomas P Zollo; Zhun Deng; Toniann Pitassi; Richard Zemel (2023). Quantile Risk Control: A Flexible Framework for Bounding the Probability of High-Loss Predictions. 
[b30] James Taylor; Berton Earnshaw; Ben Mabey; Mason Victors; Jason Yosinski (2019). Rxrx1: An Image Set for Cellular Morphological Variation Across Many Experimental Batches. 
[b31] Robert Williamson; Aditya Menon (2019). Fairness Risk Measures. PMLR
[b32] Shlomo Yitzhaki (1979). Relative Deprivation and the Gini Coefficient. The Quarterly Journal of Economics
[b33] Shlomo Yitzhaki; Edna Schechtman (2013). The Gini Methodology: a Primer on a Statistical Methodology. Springer

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: Left: Bounds on the expected loss, scaled Gini coefficient, and total objective across different hypotheses. Right: Lorenz curves induced by choosing a hypothesis based on the expected loss bound versus the bound on the total objective. The y-axis shows the cumulative share of the loss that is incurred by the best-off β proportion of the population, where a perfectly fair predictor would produce a distribution along the line y = x. The plot on the left shows how the bounds on the expected loss T 1 , scaled Gini coefficient λT 2 , and total objective L vary across the different hypotheses (i.e. model and threshold combination for producing prediction sets). The bold points indicate the optimal threshold choice for each quantity. On the right is shown the Lorenz curves (a typical graphical expression of Gini) of the loss distributions induced by choosing a hypothesis based on the expected loss bound versus the bound on the total objective. Incorporating the bound on Gini coefficient in hypothesis selection leads to a more equal loss distribution. Taken together, these figures illustrate how the ability to bound a non-group based
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: We select two hypotheses h 0 and h 1 with different bounds on Atkinson index produced using 2000 validation samples, and once again visualize the Lorenz curves induced by each. Tighter control on the Atkinson index leads to a more equal distribution of the loss (especially across the middle of the distribution, which aligns with the choice of ϵ), highlighting the utility of being able to target such a metric in conservative model selection.
Data: 

Figure fig_2: 
Type: figure
Caption: For any two non-decreasing function G(p) and C(p), by the definition of the general inverse function, G(G -(p)) ≥ p. If C ⪰ G, we therefore have C(G -(p)) ≥ G(G -(p)) ≥ p. Applying C - to both sides yields C -(C(G -(p))) ≥ C -(p). But x ≥ C -• C(x)(see e.g. Proposition 3 on p. 6 of[28]) and thus G -(p) ≥ C -(p). Plugging in F and F U as G and C, this can yield F -⪰ F - U . The other direcion is similar.
Data: 

Figure fig_3: 4
Type: figure
Caption: Figure 4 :4Figure 4: Example illustrating the construction of distribution-free CDF lower and upper bounds by bounding order statistics. On the left, order statistics are drawn from a uniform distribution. On the right, samples are drawn from a real loss distribution, and the corresponding Berk-Jones CDF lower and upper bound are shown in black. Our distribution-free method gives bound b (l) i and b (u) i on each sorted order statistic such that the bound depends only on i, as illustrated in the plots for i = 5 (shown in blue). On the left, 1000 realizations of x (1) , . . . , x (n) are shown in yellow. On the right, 1000 empirical CDFs are shown in yellow, and the true CDF F is shown in red.
Data: 

Figure fig_4: 1
Type: figure
Caption: B. 11Lorenz curve & the extended Gini familyLorenz curve. In the main context, Lorenz curve has been mentioned in reference to Gini coefficient and Atkinson index. To be more complete, we further demonstrate the definition of Lorenz curve in its mathematical form. Definition 5 (Lorenz curve). The definition of Lorenz curve is a function: for t ∈ [0, 1],
Data: 

Figure fig_5: 2
Type: figure
Caption: 22
Data: 

Figure fig_6: 11
Type: figure
Caption: 1 0F 1 011-(p)dp Hoover index involves forms like |F --µ| for µ = F -(p)dp. This type of nonlinear structure can be dealt with the absolute function results mentioned in Appendix A.1.2.
Data: 

Figure fig_7: 112
Type: figure
Caption: 1 . 1 . 2112To get the lower bound b l 1:n , we set: Numerical optimization examples (Section 5.1.2)
Data: 

Figure fig_8: 52
Type: figure
Caption: Figure 5 : 2 (52Figure 5: Plot of smoothed median function with β = 0.5 and a = 0.01
Data: 

Figure fig_9: 211
Type: figure
Caption: 2 + ( 1 - 1 |211α)l p ( Î, I)2 , wherel r ( Î, I) = 1 -1 k i∈I 1{i ∈ Î} and l p ( Î, I) = 1 -Î| i∈ Î 1{i ∈ I}where α = 0.5. We randomly sample 1500 users for validation, and use the remaining users to plot the empirical distributions. The 100 hypotheses tested are evenly spaced between the minimum and maximum scores of any user/item pair in the score matrix.
Data: 

Figure fig_10: 6
Type: figure
Caption: EFigure 66Figure 6 compares the learned bounds G opt to the Berk-Jones (G BJ ) and Truncated Berk-Jones (G BJ-t ) bounds, as well as the empirical CDF of the real loss distribution.
Data: 

Figure fig_11: 6
Type: figure
Caption: Figure 6 :6Figure6: Learning tighter bounds on functionals of interest for protected groups. On the left, a bound is optimized for CVaR with β = 0.75, and on the right a bound is optimized for the VaR Interval [0.5, 0.9]. In both cases the optimized bounds are tightest on both the target metric as well as the mean, illustrating the power of adaptation both to particular quantile ranges as well as real loss distributions.
Data: 

Figure tab_0: 1
Type: table
Caption: Applying our full framework to control an objective considering expected group loss as well as a maximum difference in group medians for n = 100 and n = 200 samples.
Data: n = 100n = 200MethodExp. Grp. Max Diff.TotalExp. Grp. Max Diff.TotalDKW0.367950.908501.276450.322360.969561.29193BJ0.345320.075490.420810.311650.006660.31831NN-Opt. (ours)0.326690.016120.342810.306190.002920.30911Empirical0.203950.000040.203990.201480.000100.20158

Figure tab_1: 2
Type: table
Caption: Optimizing bounds on measures for protected groups.
Data: MethodCVaRVaR-Interval Quantile-Weighted Smoothed-MedianBerk-Jones0.911660.380570.191520.00038Truncated Berk-Jones 0.863790.34257--NN-Opt. (ours)0.855490.326560.179220.000215.2 Investigating bounds on standard measures of dispersion


Formulas:
Formula formula_0: R ψ (F ) := 1 0 ψ(p)F -(p)dp.

Formula formula_1: 1 0 F -(p)dp > 0 1 .

Formula formula_2: G(X) := E|X -X ′ | 2EX = 1 0 (2p -1)F -(p)dp 1 0 F -(p)dp

Formula formula_3: A(ε, X) := 1 - (E[X 1-ε ]) 1 1-ε E[X] = 1 - 1 0 (F -(p)) 1-ε dp 1 1-ε 1 0 F -(p)dp .

Formula formula_4: |T (F g ) -T (F g ′ )| or [T (F g )-T (F g ′ )]

Formula formula_5: D CV,α (µ g ) := CV aR α,PIdx µ g -E g∼PIdx [µ g ] .

Formula formula_6: D CV,α (T (F g )) = min ρ∈R ρ + 1 1 -α • E g∼PIdx [T (F g ) -ρ] + -E g∼PIdx [T (F g )].

Formula formula_7: ρ ξ (P Idx ) := E g∼PIdx [ξ(T (F g ))]

Formula formula_8: E g∼PIdx T (F g ) -E g∼PIdx T (F g ) 2 ; and E ψ [ξ(F -(α))] := 1 0 ξ(F -(α)

Formula formula_9: Proposition 1. For the CDF F of X, if there exists two CDFs F U , F L such that F U ⪰ F ⪰ F L 2 , then we have F - L ⪰ F -⪰ F - U .

Formula formula_10: P( F δ n,U ⪰ F ⪰ F δ n,L ) ≥ 1 -δ.

Formula formula_11: F (X (i) ) ≥ s - i (s δ )) ≥ 1 -δ, for s δ = inf r {r : P(S ≥ r) ≥ 1 -δ}.

Formula formula_12: Lemma 1. For 0 ≤ L 1 ≤ L 2 • • • ≤ L n ≤ 1, if P(∀i : F (X (i) ) ≥ L i ) ≥ 1 -δ, then, we have P(∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1 ) ≥ 1 -δ. Furthermore, let R(x) be defined as 1 -L n if x < X (1) ; 1 -L n-i+1 if X (i) ≤ x < X (i+1) for i ∈ {1, 2, • • • , n -1}; 1 if X (n) ≤ x. Then, F ⪯ R.

Formula formula_13: F δ n,U ⪰ F ⪰ F δ n,L ; then further by Proposition 1, we have that ξ( F δ,- n,L ) ⪰ ξ( F -) ⪰ ξ( F δ,- n,U

Formula formula_14: F -. Example 1 (Gini coefficient). If given a (1 -δ)-CBP ( F δ n,L , F δ n,U

Formula formula_15: G(X) = 1 0 (2p -1)F -(p)dp 1 0 F -(p)dp = 1 0 2pF -(p)dp 1 0 F -(p)dp -1.

Formula formula_16: G(X) ≤ 1 0 2p F δ,- n,L (p)dp 1 0 F δ,- n,U (p)dp -1,

Formula formula_17: ϕ(s) = k=0 α k s k , if k is odd, s k is monotonic w.r.t. s; if k is even, s k = |s| k .

Formula formula_18: Example 2. If we have (T δ L (F g ), T δ U (F g )) such that T δ L (F g ) ≤ T (F g ) ≤ T δ U (F g

Formula formula_19: ξ(T (F g1 ) -T (F g2 ))

Formula formula_20: |T (F g1 ) -T (F g2 )| ≤ max{|T δ U (F g1 ) -T δ L (F g2 )|, |T δ L (F g1 ) -T δ U (F g2 )|}.

Formula formula_21: Theorem 1. For ( F δ n,L , F δ n,U

Formula formula_22: 1 -δ, ξ(F -) ⪯ f 1 ( F δ,- n,L ) -f 2 ( F δ,- n,U ).

Formula formula_23: 1 0 ξ(F -(α))ψ(α)dα

Formula formula_24: 1 0 ξ(T (F -(α)))ψ(α)dα

Formula formula_25: 0 ≤ L 1 ≤ • • • ≤ L n ≤ 1, we have P ∀i, F (X (i) ) ≥ L i ≥ n! 1 Ln dx n xn Ln-1 dx n-1 • • • x2 L1 dx 1 , where the right-hand side integral is a function of {L i } n i=1

Formula formula_26: 1 0 ψ(p)F -(p)dp as an example. For any {L i } n i=1 satisfying P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, one can use conservative CDF completion to obtain F δ n,L , i.e. 1 0 ψ(p)ξ( F δ,- n,L (p))dp = n+1 i=1 ξ(X (i) ) Li Li-1 ψ(p)dp,

Formula formula_27: min {Li} n i=1 n+1 i=1 ξ(X (i) ) Li Li-1 ψ(p)dp such that P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, and 0 ≤ L 1 ≤ • • • ≤ L n ≤ 1.

Formula formula_28: γ * = inf{γ : n!υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1) ≥ 1 -δ, γ ≥ 0}.

Formula formula_29: L = E g [T 1 (F g )] + λ sup g,g ′ |T 2 (F g ) -T 2 (F g ′ )|

Formula formula_30: P( F δ,- n,L ⪯ F ⪯ F δ,- n,U

Formula formula_31: F U , F L such that F U ⪰ F ⪰ F L , then we have F - L ⪰ F -⪰ F - U .

Formula formula_32: ξ( F δ,- n,L ) ⪰ ξ( F -) ⪰ ξ( F δ,- n,U ) with probability at least 1 -δ. Similarly, if ξ is a decreasing function, then ξ( F δ,- n,L ) ⪯ ξ( F -) ⪯ ξ( F δ,- n,U

Formula formula_33: Example 3 (Gini coefficient). If given a (1 -δ)-CBP ( F δ n,L , F δ n,U

Formula formula_34: G(X) = 1 0 (2p -1)F -(p)dp 1 0 F -(p)dp = 1 0 2pF -(p)dp 1 0 F -(p)dp -1.

Formula formula_35: G(X) ≤ 1 0 2p F δ,- n,L (p)dp 1 0 F δ,- n,U (p)dp -1,

Formula formula_36: A(ε, X) :=        1 - 1 0 (F -(p)) 1-ε dp 1 1-ε 1 0 F -(p)dp , if ε ≥ 0, ε ̸ = 1; 1 - exp( 1 0 ln(F -(p))dp) 1 0 F -(p)dp , if ε = 1.

Formula formula_37: CBP ( F δ n,L , F δ n,U ), if F δ n,L ⪰ 0, let us define A δ U (ε, X) := 1- 1 0 ( F δ,- n,U (p)) 1-ε dp 1 1-ε 1 0 F δ,- n,L (p)dp , if ε ≥ 0, ε ̸ = 1; 1 - exp( 1 0 ln( F δ,- n,U (p))dp) 1 0 F δ,- n,L (p)dp , if ε = 1. Then, with probability at least 1 -δ, A δ U (ε, X) is an upper bound for A(ε, X) for all ε ∈ [0, 1).

Formula formula_38: D CV,α (T (F g )) = min ρ∈R ρ + 1 1 -α • E g∼PIdx [T (F g ) -ρ] + -E g∼PIdx [T (F g )].

Formula formula_39: (T δ L (F g ), T δ U (F g )) such that T δ L (F g ) ≤ T (F g ) ≤ T δ U (F g )

Formula formula_40: D CV,α (T (F g )) ≤ min ρ∈R ρ + 1 1 -α • E g∼PIdx [T δ U (F g ) -ρ] + -E g∼PIdx [T δ L (F g )],

Formula formula_41: s L 1{s L ≥ 0} -s U 1{s U ≤ 0} ≤ |s| ≤ max{|s U |, |s L |}.

Formula formula_42: ϕ(s) ≤ {k is odd, α k ≥0} α k s k U + {k is odd, α k <0} α k s k L + {k is even, α k ≥0} α k max{|s L | k , |s U | k } + {k is even, α k <0} α k (s L 1{s L ≥ 0} -s U 1{s U ≤ 0}) k .

Formula formula_43: Proposition 3. If given a (1 -δ)-CBP, then with probability at least 1 -δ, ( F δ n,L , F δ n,U ), F δ,- n,U 1{ F δ,- n,U ≥ 0} -F δ,- n,L 1{ F δ,- n,L ≤ 0} ⪯ |F -| ⪯ max{| F δ,- n,L |, | F δ,- n, |}.

Formula formula_44: ϕ(F -) ⪯ {k is odd, α k ≥0} α k ( F δ,- n,L ) k + {k is odd, α k <0} α k ( F δ,- n,U ) k + {k is even, α k ≥0} α k max{| F δ,- n,U | k , | F δ,- n,L | k } + {k is even, α k <0} α k ( F δ,- n,U 1{ F δ,- n,U ≥ 0} -F δ,- n,L 1{ F δ,- n,L ≤ 0}) k . Example 6. If we have (T δ L (F g ), T δ U (F g )) such that T δ L (F g ) ≤ T (F g ) ≤ T δ U (F g

Formula formula_45: ξ(T (F g1 ) -T (F g2 ))

Formula formula_46: |T (F g1 ) -T (F g2 )| ≤ max{|T δ U (F g1 ) -T δ L (F g2 )|, |T δ L (F g1 ) -T δ U (F g2 )|}.

Formula formula_47: Π = {π = (x 0 , x 1 , • • • , x nπ ) | π is a partition of [a, b] satisfying x i ≤ x i+1 for all 0 ≤ i ≤ n π -1}.

Formula formula_48: V b a (ξ) := sup π∈Π nπ i=0 |ξ(x i+1 ) -ξ(x i )|

Formula formula_49: ξ ∈ BV([a, b]) iff V b a (ξ) < ∞.

Formula formula_50: ξ(F -(p)) ≤ V F δ,- n,L (p) 0 (ξ) -V F δ,- n,U (p) 0 (ξ) + ξ( F δ,- n,U (p)).

Formula formula_51: x 0 | dξ ds (s)|ds for any x ∈ [0, F δ,- n,L (p)].

Formula formula_52: x ∈ [0, F δ,- n,L (p)] ξ(x) = V x 0 (ξ) -(V x 0 (ξ) -ξ(x))

Formula formula_53: V x 0 (ξ) = x 0 dξ ds (s) ds if ξ is continuously differentiable.

Formula formula_54: ξ(F -(p)) ≤ V F δ,- n,L (p) 0 (ξ) -V F δ,- n,U (p) 0 (ξ) + ξ( F δ,- n,U (p)).

Formula formula_55: ξ(F -) ⪯ V F δ,- n,L 0 (ξ) -V F δ,- n,U 0 (ξ) + ξ( F δ,- n,U ) = f 1 ( F δ,- n,L ) -f 2 ( F δ,- n,U ).

Formula formula_56: Lemma 2 (A restatement & formal version of Lemma 1). For 0 ≤ L 1 ≤ L 2 • • • ≤ L n ≤ 1, since P(∀i : F (X (i) ) ≥ L i ) ≥ P(∀i : U (i) ≥ L i ) by

Formula formula_57: U (i) ≥ L i ) ≥ 1 -δ, then we have P(∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1 ) ≥ 1 -δ. Furthermore, let R(x) be defined as R(x) =              1 -L n , for x < X (1) 1 -L n-1 , for X (1) ≤ x < X (2) . . . 1 -L 1 , for X (n-1) ≤ x < X (n) 1, for X (n) ≤ x.

Formula formula_58: P {X (i) } n i=1 (∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1 ) =P {X (i) } n i=1 (∀i : P X (X ≥ X (i) ) > L n-i+1 ) =P {X (i) } n i=1 (∀i : P X (-X ≤ -X (i) ) > L n-i+1 ) =P {X (i) } n i=1 (∀i : P B (B ≤ B (n-i+1) ) > L n-i+1 ) =P(∀i : F B • F - B (U (n-i+1)) > L n-i+1 ) ≥P(∀i : U (n-i+1) > L n-i+1 ).

Formula formula_59: F (X (i) ) ≥ L i ) ≥ P(∀i : U (i) ≥ L i ) ≥ 1 -δ. The conservative construction of R satisfies R ⪰ F straightforwardly if ∀i : lim ϵ→0 + F (X (i) -ϵ) ≤ 1 -L n-i+1

Formula formula_60: 0 ≤ L 1 ≤ • • • ≤ L n ≤ 1, P ∀i, F (X (i) ) ≥ L i ≥ P ∀i, U (i) ≥ L i ≥ n! 1 Ln dx n xn Ln-1 dx n-1 • • • x2 L1 dx 1 ,

Formula formula_61: υ(L 1 , L 2 , • • • , L n , 1) := 1 Ln dx n xn Ln-1 dx n-1 • • • x2 L1 dx 1

Formula formula_62: ∂ Li υ(L 1 , L 2 , • • • , L n , 1) = - 1 Ln dx n xn Ln-1 dx n-1 • • • xi+2 Li+1 dx i+1 • Li Li-1 dx i-1 • • • x2 L1 dx 1 , = -υ(L i+1 , • • • , L n , 1) • υ(L 1 , • • • , L i-1 , L i ),

Formula formula_63: {L i } n i=1 satisfying P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, one can use conservative CDF completion in

Formula formula_64: 1 0 ψ(p)ξ( F δ,- n,L (p))dp = n+1 i=1 ξ(X (i) )

Formula formula_65: min {Li} n i=1 n+1 i=1 ξ(X (i) ) Li Li-1 ψ(p)dp such that P ∀i, F (X (i) ) ≥ L i ≥ 1 -δ, and 0 ≤ L 1 ≤ • • • ≤ L n ≤ 1.

Formula formula_66: 1 0 ψ(p)F -(p)dp, n i=1 ξ(X (i) ) Ln-i+1 Ln-i ψ(p)dp

Formula formula_67: L i (θ) = i j=1 exp(ϕ θ (g j )) 1 + n j=1 exp(ϕ θ (g j ))

Formula formula_68: min {θ} n i=1 n+1 i=1 ξ(X (i) ) Li(θ) Li-1(θ) ψ(p)dp (1) such that n! 1 Ln(θ) dx n xn Ln-1(θ) dx n-1 • • • x2 L1(θ) dx 1 ≥ 1 -δ,

Formula formula_69: 1 Ln(θ) dx n xn Ln-1(θ) dx n-1 • • • x2 L1(θ) dx 1 ≥ 1 -δ is

Formula formula_70: γ * = inf{γ : n!υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1) ≥ 1 -δ, γ ≥ 0}.

Formula formula_71: n!υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1) ≥ P ∀i, U (i) ≥ 0 = 1 and υ(L 1 ( θ) -γ, • • • , L n ( θ) -γ, 1

Formula formula_72: L(t) = t 0 F -1 (p) dp 1 0 F -1 (p) dp .

Formula formula_73: L δ L (t) = t 0 F δ,- n,U (p) dp 1 0 F δ,- n,L(p) dp

Formula formula_74: L δ U (x) = t 0 F δ,- n,L (p) dp 1 0 F δ,- n,U(p) dp

Formula formula_75: G(ν, X) : = -νCov(X, [1 -F (X)] ν-1 ) E[X] = 1 - ν 1 0 (1 -p) ν-1 F -(p)dp 1 0 F -(p)dp ,

Formula formula_76: G δ U (ν, X) := 1 - ν 1 0 (1 -p) ν-1 F δ,- n,U (p)dp 1 0 F δ,- n,L (p)dp , then G δ U (ν, X) ⪰ G(ν, X

Formula formula_77:                1 α(α-1) E X EX α -1 , α ̸ = 0, 1 E X EX ln( X EX ) , if α = 1 -E ln( X EX ) , if α = 0.

Formula formula_78: GE(α, X) :=                  1 α(α-1) 1 0 F -(p) 1 0 F -(p)dp α -1 dp, α ̸ = 0, 1 1 0 F -(p) 1 0 F -(p)dp ln( F -(p) 1 0 F -(p)dp ) dp, if α = 1 - 1 0 ln( F -(p) 1 0 F -(p)dp ) dp, if α = 0.

Formula formula_79: 1 α(α -1) 1 0 F -(p) 1 0 F -(p)dp α -1 dp ≤ 1 α(α -1) 1 0 F δ,- n,L (p) 1 0 F δ,- n,U (p)dp α -1 dp.

Formula formula_80: H(X) = 1 0 |F -(p) - 1 0 F -(q)dq|dp

Formula formula_81: H U (X) = 1 0 max{| F δ,- n,L (p) - 1 0 F δ,- n,U (q)dq|, | F δ,- n,U (p) - 1 0 F δ,- n,L (q)dq|}dp 2 1 0 F δ,- n,U (p)dp .

Formula formula_82: i∈[k] X i is F k . Thus, by the result of Appendix A.1.2, if given a (1 -δ)-CBP ( F δ n,L , F δ n,U ) and 1 ⪰ F δ,- n,U ⪰ F δ,- n,L ⪰ 0, with probability at least 1 -δ, ( F δ,- n,L ) k ⪯ F k ⪯ ( F δ,- n,U ) k . We also have ( F δ,- n,U ) k ⪯ F k ⪯ ( F δ,- n,L ) k . Similarly, for min i∈[k] X i , the CDF is 1 -(1 -F ) k , thus, we have 1 -(1 -F δ,- n,U ) k ⪯ F k ⪯ 1 -(1 -F δ,- n,L ) k .

Formula formula_83: k F -(x)[F k-1 (x) -F k (x)]dF (x) = k 1 0 F -(F -(p))[F k-1 (F -(p)) -F k (F -(p))]dp Notice that both F and F -are increasing. Thus, if given a (1 -δ)-CBP ( F δ n,L , F δ n,U ), F δ n,L ⪰ 0, then with probability at least 1 -δ, 1 0 F δ,- n,L F δ,- n,L (p) ( F δ n,U ) k F δ,- n,L (p) -( F δ n,L ) k F δ,- n,U (p) dp

Formula formula_84: {X i } n i=1 , each of k dimensions, i.e. X i = (X i 1 , • • • , X i k ), for any k-dimensional vector x = (x 1 , • • • , x k ), define empirical CDF Fn (x) = 1 n n i=1 1{X i ⪯ x}.

Formula formula_85: | Fn (x) -F (x)| ≤ ln(k(n + 1)/δ) 2n.

Formula formula_86: max{1 -k + k i=1 F i (x i ), 0} ≤ F (x) ≤ min{F 1 (x 1 ), • • • , F k (x k )}

Formula formula_87: F δ/k,i n,L , F δ/k,i n,U ) such that ( F δ/k,i n,L ⪯ F i ⪯ F δ/k,i n,U )

Formula formula_88: max{1 -k + k i=1 F δ/k,i n,L (x i ), 0} ≤ F (x) ≤ min{ F δ/k,1 n,U (x 1 ), • • • , F δ/k,k n,U (x k )}

Formula formula_89: F (x) ≥ max{1 -k + k i=1 F δ/k,i n,L (x i ), 0, Fn (x) - ln(k(n + 1)/δ) 2n } F (x) ≤ min{ F δ/k,1 n,U (x 1 ), • • • , F δ/k,k n,U (x k ), Fn (x) + ln(k(n + 1)/δ) 2n }

Formula formula_90: Γ X,Y := Cov(X, F Y (Y )) Cov(X, F X (X)) = F X,Y (x, y) -F X (x)F Y (Y ) dxdF Y (y) Cov(X, F X (X)) ,

Formula formula_91: h(v) = 1 1 + exp(wv + b)

Formula formula_92: L = 1 n n i=1 (f i -o i ) 2

Formula formula_93: O = T 1 (F g ) + T 2 (F g )

Formula formula_94: ψ(p; β) = 1 a √ π exp(- (p -β) 2 a 2 )

