Sensitivity Auditing for Trustworthy Language Models

Hannah Cyberey

Published: 25 Jul 2025, Last Modified: 20 Jan 2026University of VirginiaEveryoneRevisionsCC BY-SA 4.0

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. Yet, they remain unreliable and pose serious social and ethical risks, including reinforcing social stereotypes, spreading misinformation, and facilitating malicious uses. Despite their growing presence in high-stakes settings, current evaluation practices often fail to address these risks. This dissertation aims to advance the reliability of LLMs by developing rigorous, context-aware evaluation methodologies. We argue that model reliability should be assessed with respect to its intended uses (i.e., how it should operate and under what context) through fine-grained measurements beyond binary judgments. We propose to (1) improve evaluation reliability, (2) design mitigation strategies to control model behavior, and (3) develop auditing techniques for accountability. We begin by studying the fundamental gaps in current evaluation practices, showing that discrepancies between evaluations and underlying goals can misrepresent the capabilities of models and the effectiveness of mitigation strategies. To address these gaps, we develop a graph-based data augmentation method for improving dataset consistency and sensitivity analysis tools for examining bias properties captured by common bias metrics. Next, we introduce mitigation strategies that enable balanced and precise control of models. We propose balanced adversarial training to address issues in conventional methods that often create overly robust models that fail to reflect meaningful input changes. To control models efficiently, we also propose inference-time intervention using steering vectors to manipulate model outputs through internal representations. We show how these vectors can be used to mitigate gender bias and control model censorship without degrading overall model utility. Finally, we present auditing techniques to assess risks in high-stakes applications. We demonstrate that widely used bias metrics are ineffective for assessing potential harms from the use of LLMs in allocation contexts. Further, we develop white-box bias audits using steering vectors to conduct internal model sensitivity tests, enabling more comprehensive assessments that help uncover problematic behaviors overlooked by black-box input-level tests. Ultimately, our work highlights the need for more reliable and robust evaluation measures to build trust in the use of LLMs and mitigate potential harms in critical deployment settings.

External IDs:doi:10.18130/pwec-n984