Dynamically Choosing the Number of Heads in Multi-Head Attention

Published: 01 Jan 2024, Last Modified: 17 Oct 2024ICAART (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deep Learning agents are known to be very sensitive to their parameterization values. Attention-based Deep Reinforcement Learning agents further complicate this issue due to the additional parameterization associated to the computation of their attention function. One example of this concerns the number of attention heads to use when dealing with multi-head attention-based agents. Usually, these hyperparameters are set manually, which may be neither optimal nor efficient. This work addresses the issue of choosing the appropriate number of attention heads dynamically, by endowing the agent with a policy πh trained with policy gradient. At each timestep of agent-environment interaction, πh is responsible for choosing the most suitable number of attention heads according to the contextual memory of the agent. This dynamic parameterization is compared to a static parameterization in terms of performance. The role of πh is further assessed by providing additional analysis concerning the d
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview