Fairness principles across contexts: evaluating gender disparities of facts and opinions in large language models

Sofie Goethals, Lauren Rhue, Arun Sundararajan

Published: 2026, Last Modified: 09 Apr 2026AI Ethics 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper examines how fairness principles differ when evaluating large language model (LLM) outputs in fact-based versus opinion-based contexts, focusing on gender disparities in responses related to notable individuals. Using prompts designed to elicit either factual information (identifying Nobel Prize winners) or subjective judgments (identifying the most accomplished figures in a field), we analyze responses from GPT-4, Claude, and Llama-3. For fact-based tasks, fairness is assessed through correctness and refusal rates, revealing minimal gender disparities when models achieve high accuracy, although refusal patterns can vary by model and gender. For opinion-based tasks, where no single correct answer exists, fairness is operationalized through representational metrics such as demographic parity and disparate impact. Results show substantial gender disparities in opinion-based outputs across all models, with representation shaped by prompt wording (e.g., “important” vs. “prestigious”), subject domain, and inclusion of secondary answers. However, the highly skewed context makes the final assessment about fairness challenging. Our findings highlight that fairness metrics and interpretations must be contextualized by output type. Performance parity is an appropriate goal for fact-based outputs, whereas representational inclusivity is central for opinion-based outputs. Representational inclusivity alone may not be sufficient when the context for the LLM’s task differs from the population. We discuss theoretical implications for fairness evaluation, noting that high performance can mitigate disparities in factual contexts but that opinion-based contexts require more nuanced, values-driven approaches.
Loading