Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning

Wenhao Zhao; Qiushui Xu; Linjie Xu; Lei Song; Jinyu Wang; Chunlai Zhou; Jiang Bian

Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning

Wenhao Zhao, Qiushui Xu, Linjie Xu, Lei Song, Jinyu Wang, Chunlai Zhou, Jiang Bian

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.

Lay Summary: There's been a lot of interest in integrating pretrained language model knowledge into decision transformers for offline reinforcement learning. These models excel at RL tasks, leading to questions about which PLM knowledge enhances RL performance. Key findings highlight the "Markov head" in pretrained language models, focusing intensely on the last input token, effective in a specific type of environments. This focus can't be changed through fine-tuning. Based on these insights, the proposed GPT-DTMA method equips DTs with Mixture of Attention, enabling adaptive learning. Our research highlights the limitations of different models in two representative types of environments and sheds light on how to design a model that is applicable in any scenario.

Primary Area: Reinforcement Learning->Planning

Keywords: decision transformer; cross-domain

Submission Number: 9905

Loading