Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Existing methods for aggregating transformer embeddings (MaxPool, AvgPool, ClsToken) are vulnerable to noisy inputs, we show that an attention-based pooling mechanism is provably robust across the spectrum of input signal-to-noise ratios.
Abstract: We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based *adaptive pooling* method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.
Lay Summary: How do self-driving cars identify what’s important, like picking out a pedestrian from a sea of vehicles? Most modern AI systems use “transformer” models, which condense many pieces of data into one meaningful summary. This step, called pooling, is critical for making decisions. But in real-world tasks, where much of the data is distracting or noisy, common pooling methods like averaging or picking the strongest signal may not work as intended – causing our self-driving car to ignore that stray pedestrian. Our paper shows that these pooling methods do indeed fail when there’s a large number of distractions. We then predict that a lesser-known technique, called adaptive pooling, can prevent such a performance collapse, even in the presence of many distractors. It does this by using attention to learn which parts of the input matter and what to ignore – like tuning out a noisy crowd to listen to a single voice. Our findings demonstrate that adaptive pooling can closely match the best possible summary of the inputs. More importantly, we show that simply swapping out the old methods for adaptive pooling can significantly improve the reliability and trustworthiness of transformer-based models in many applications, even when their inputs are quite messy.
Link To Code: https://github.com/agbrothers/pooling
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Pooling, Attention, Vector Quantization, AdaPool, Transformer, Robustness
Submission Number: 7325
Loading