From Style to Cultural Calibration: Evaluating Institutional Voice in LLM-Generated News
Keywords: Cultural AI, institutional voice, LLM evaluation, journalism, prompt engineering, framing, computational social science
Abstract: As large language models increasingly generate culturally situated text, evaluating their success requires moving beyond surface style to examine institutional voice. We propose that institutional voice consists of multiple layers, each with different sensitivity to prompt engineering. Using The New York Times China coverage (2020--2024) as a calibrated case, we generate matched articles with GPT-4o-mini and Gemini 3 Flash and compare them to real NYT reporting. We identify a layered pattern: models reproduce entity-level desk differentiation by default; prompting partly recovers desk-level affective ordering; but the Foreign-desk affective baseline remains resistant to calibration. A manual qualitative audit further shows that this miscalibration involves interpretive flattening: generated articles preserve plausible journalistic form while reducing reported particularity, narrative friction, and political-critical edge. We offer the reproduction--recovery--miscalibration framework as a diagnostic for evaluating the alignment and limits of cultural AI.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 21
Loading