Machine Ecology and Interpretability-Informed Control

Published: 10 Jun 2026, Last Modified: 10 Jun 2026AITC 2026 TalkEveryoneRevisionsCC BY 4.0
Keywords: Machine Ecology, AI Transparency, Interpretability, AI Control, Multi-Agent Safety, Chain-of-Thought Monitoring, Language Drift
Abstract: We can inspect the reasoning trace of language models for signs of potential misalignment. But emergent communication research shows that training populations may quickly diverge from human-interpretable language. As large language models are increasingly deployed as interacting agents, such language drift threatens chain-of-thought monitoring techniques. More generally, monitors that rely on surface behavior are brittle against systems that adapt, whether through context or training. We argue that we must advance interpretability techniques from focusing on individual models to studying entire neural ecosystems. We propose a multi-scale framework spanning small ecosystems (a single model and its developmental environment), medium ecosystems (agentic and multi-agent systems), and large ecosystems (populations of agents in shared environments). The key argument presented here is that for effective control over AI systems, we should prioritize research on monitors that (A) are informed by interpretability and (B) are scalable across all ecosystem levels.
Presentation Format: We prefer to present our paper in a 15-20 minute talk
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
LaTeX Source Files: zip
LLM Policy: LLMs may be used to understand the paper and related work but may not be used to assess quality, identify strengths/weakesses, draft the review structure or write the review itself
Submission Number: 28
Loading