GBP-Audit: AI Bias Intervention with Built-In Safety-First Guardrails

Published: 09 Mar 2026, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: We introduce GBP-Audit, an audit-then-act post-hoc fairness layer for frozen models when retraining is impractical. GBP-Audit first diagnoses whether observed disparities are removable proxies or task-aligned using geometric coherence (κ) between bias and task directions, a stress-test injection, and multi-seed instability. It applies an edit only when statistical guardrails certify safety - AUROC non-inferiority via TOST (δ=0.005), calibration preservation (|ΔECE| ≤ ϵ=0.003), and optional fixed-FPR drift caps; otherwise it abstains and records the rationale. a Across seven public tabular classification tasks, GBP-Audit delivered safe parity gains on 3/7 datasets while maintaining utility (e.g., Adult: −55% EO; Bank: −73%) and correctly abstained on 4/7 where counterfactual edits would harm performance (e.g., Employment: EO +0.043; Taiwan: AUROC degradation). Each decision emits a machine-readable Governance Packet (baselines, diagnostics, guardrail outcomes, expected impacts) to support compliance and auditability (e.g., NYC Local Law 144, EU AI Act, SR 11–7). Deployment is attribute-free at inference, fully reversible, and adds <1 ms CPU latency per 32 examples. GBP-Audit operationalizes the question "can we fix this bias without retraining?" into a guarded decision: ship a calibrated, non-inferior mitigation - or return evidence that remediation requires upstream change.
Loading