Keywords: multi-agent debate, sociotechnical alignment, multi-turn evaluation
TL;DR: We used multi-agent deliberation of everyday dilemmas to evaluate the sociotechnical alignment of language models.
Abstract: As large language models are used increasingly in sensitive everyday contexts -- offering personal advice, mental health support, and moral guidance -- understanding their elicited values in navigating complex moral reasoning becomes crucial. Many evaluations study sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn scales where values emerge through dialogue, revision, and consensus. We use multi-agent deliberation to assess value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to examine order effects and verdict revision rates. Our findings show striking differences in models' revision tendencies: GPT exhibited strong inertia (0.6-3.1% revision rates) while Claude and Gemini showed higher flexibility (28-41%). We identify distinct value patterns, with GPT emphasizing personal autonomy and direct communication, while Claude and Gemini prioritize empathetic dialogue. We further demonstrate that specific values are more effective at driving changes in verdicts. Round-robin deliberation substantially increased consensus rates relative to the synchronous setting through strong order effects. Using a multinomial logistic model, we quantify inertia and conformity effects, finding GPT 2-3x more resistant to change than other models. While system prompts increased flexibility, they often drove divergent rather than convergence. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.
Submission Number: 174
Loading