Steering LLMs for Multi-agent Decision-making using Representation Learning

Published: 02 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 Workshop AIMSEveryoneRevisionsCC BY 4.0
Keywords: representation learning, llm, steering
Abstract: Activation steering offers a lightweight mechanism for controlling large language models (LLMs), but existing approaches have yet been integrated within strategic multi-agent decision-making settings. In this work, we propose a representation learning framework for activation steering tailored to multi-agent decision-making, optimizing steering representations directly from interaction trajectories by grounding latent variables in multi-agent dynamics and enforcing latent self-consistency over time. Our approach disentangles latent factors underlying strategic interaction, enabling fine-grained behavioral control without modifying model parameters or relying on task-specific supervision but on the nature of the multi-agent dynamics. We evaluate our method on $\gamma$-Bench, a diverse suite of cooperative, competitive, and mixed-motive games, and demonstrate consistent improvements in social and strategic performance across multiple open-source LLM families. These results suggest that representation learning provides a scalable and interpretable foundation for activation steering in multi-agent systems.
Track: Short Paper
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading