Abstract: Large Language Models (LLMs) are widely used for tasks like translation and conversation but often reflect cultural biases from their training data. We introduce two frameworks, “The Life of X” and “RECAP”—to evaluate how LLMs handle culturally grounded prompts. The Life of X explores how identity cues (e.g., names) influence narrative framing, while RECAP tests model sensitivity to prompt specificity in recommendation tasks using semantic and manual metrics. Evaluating models like GPT-4, Gemini, and Claude, we find consistent identity-driven variation and a tendency to default to Western norms unless explicitly guided. These findings highlight the need for culturally aware evaluation methods in LLM development.
External IDs:doi:10.1007/978-3-032-18477-1_45
Loading