Keywords: therapeutic AI, evaluation metrics, fairness, clinical deployment, accountability, machine learning, medical AI, subgroup fairness, tail risk, AUC, performance metrics
TL;DR: Standard ML metrics like AUC encode implicit social norms about acceptable risk; we prove high AUC can mask severe subgroup failures, and propose a framework linking metric choices to explicit clinical accountability.
Abstract: Clinical and regulatory discussions about trustworthy therapeutic AI speak in ethical and legal terms, while technical work reports performance through AUC, F1, PPV, or survival concordance. The mapping between these numerical summaries and the social norms they support is implicit and often incoherent. This gap is acute for therapeutic AI systems built on spatially resolved data---such as digital pathology, radiology, and spatial omics---where models guide target selection, biomarker discovery, and responder prediction.
We provide a technical--normative analysis of common evaluation metrics. We formalize metric families by their aggregation operations: expectation-based (expected loss; AUC as a U-statistic over pairs), quantile/tail-based (median, upper quantiles, CVaR), supremum-type (worst-group risk, minimax regret), thresholded confusion-matrix ratios (PPV/precision, sensitivity, F1), and ranking metrics (top-$k$, average precision). For each family we identify the implicit social norm: maximizing average benefit, protecting typical patients, guarding against worst-case harms, or prioritizing top-predicted benefit.
We prove an incompatibility result showing that high AUC can coexist with very low worst-group sensitivity under deployment-relevant thresholding, and we illustrate further incompatibilities via clinical examples. We then propose a metric design framework: (i) explicit normative declarations, (ii) multi-objective evaluation with subgroup and tail-risk constraints, and (iii) deployment checklists tying thresholds to institutional responsibilities. The goal is not to replace ethical debate with formulas, but to make explicit and auditable the value judgments encoded by metric choices.
Submission Number: 19
Loading