LatentGate: Low-Latency Semantic Routing via \\ Frozen-Backbone Probing of Small Language Models

Published: 18 Apr 2026, Last Modified: 18 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent systems, semantic routing, representation anisotropy, PCA-whitening, small language models, linear probing, agent orchestration, low-latency inference, enterprise AI
TL;DR: We diagnose representation anisotropy as the cause of embedding-based routing failure in multi-agent systems and fix it with PCA-whitening over frozen SLM hidden states, achieving 80% OOD accuracy at ~28ms across 100 agents.
Abstract: As Multi-Agent Systems (MAS) scale to hundreds of specialized agents, the routing layer becomes a critical bottleneck. Traditional approaches force a stark trade-off: prompt-based LLM routers deliver high semantic reasoning but incur prohibitive latency (~1500–2000ms) and cost that grows with agent count, while embedding-based routers operate at low latency (25–50ms on a T4 for cosine/centroid-style routing) but fail to capture nuanced functional intent, collapsing semantically similar but functionally distinct agents. We identify representation anisotropy, the geometric collapse of hidden-state vectors into a narrow cone, as a key mechanism underlying this embedding-based routing failure. We propose LatentGate, a non-generative routing architecture that extracts mean-pooled hidden states from a frozen small language model (SLM), applies PCA-whitening (decorrelation + variance normalization) to resolve the anisotropy, and trains a lightweight linear probe for agent classification. Experiments across 5 SLM backbones and 100 enterprise agents show that LatentGate achieves 98.8% in-domain accuracy and 80.0% out-of-distribution accuracy on natural queries, outperforming embedding-based routers by 13–22 absolute points. LatentGate operates at ~28ms on a T4 GPU; the SLM forward pass is independent of agent count, with classification adding an $O(Ck)$ term (agents $C$, whitened dimension $k$) that is negligible at $C = 100$ and small relative to the SLM forward pass. The lightweight linear probe additionally enables sub-10ms warm-start retraining from user feedback, offering a path toward self-healing routing in production. We further benchmark prompt-based routing with GPT-4.1, GPT-4.1-nano, and Gemini 2.5 Flash, demonstrating that these degrade to 70–77% accuracy at 100 agents while incurring 1500–2000ms latency, confirming the need for non-generative alternatives. We categorize this work as Emerging, as it introduces a new routing primitive rather than reporting a completed long-term deployment study.
Submission Type: Emerging
Submission Number: 534
Loading