Keywords: Transformers, RNN, HMM, representation learning, expressive power
Abstract: This paper investigate the capability of transformer in learning a fundamental sequential model --- the Hidden Markov Model (HMM). We design various types of HMM examples and variants inspired by theory, and conduct extensive experiments testing and comparing the performance of both transformers and Recurrent Neural Networks (RNNs). Our experiments reveal three important findings: (1) Transformers can effectively learn a large number of HMMs, but this require the depth of transformers to be at least logarithmic in the sequence length; (2) There are challenging HMMs where Transformers struggle to learn, while RNNs succeed. We also consistently observe that Transformers underperform RNNs in both training speed and testing accuracy across all tested HMM models. (3) Long mixing times and the lack of access to intermediate latent states significantly degrade Transformer's performance, but has much less impact on RNNs' performance. To address the limitation of transformers in modeling HMMs, we demonstrate that a variant of the Chain-of-Thought (CoT), called \emph{block CoT} in the training phase, can help transformers to reduce the evaluation error and to learn longer sequences at a cost of increasing the training time. Finally, we complement our empirical findings by theoretical results proving the expressiveness of transformers in approximating HMMs with logarithmic depth.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10495
Loading