Steering LLMs for Multi-agent Decision-making using Representation Learning

Dom Huh; Prasant Mohapatra

Steering LLMs for Multi-agent Decision-making using Representation Learning

Dom Huh, Prasant Mohapatra

Published: 02 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 Workshop AIMSEveryoneRevisionsCC BY 4.0

Keywords: representation learning, llm, steering

Abstract: Activation steering offers a lightweight mechanism for controlling large language models (LLMs), but existing approaches have yet been integrated within strategic multi-agent decision-making settings. In this work, we propose a representation learning framework for activation steering tailored to multi-agent decision-making, optimizing steering representations directly from interaction trajectories by grounding latent variables in multi-agent dynamics and enforcing latent self-consistency over time. Our approach disentangles latent factors underlying strategic interaction, enabling fine-grained behavioral control without modifying model parameters or relying on task-specific supervision but on the nature of the multi-agent dynamics. We evaluate our method on $\gamma$-Bench, a diverse suite of cooperative, competitive, and mixed-motive games, and demonstrate consistent improvements in social and strategic performance across multiple open-source LLM families. These results suggest that representation learning provides a scalable and interpretable foundation for activation steering in multi-agent systems.

Track: Short Paper

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 59

Loading