CoTFormer: More Tokens With Attention Make Up For Less Depth

Amirkeivan Mohtashami; Matteo Pagliardini; Martin Jaggi

CoTFormer: More Tokens With Attention Make Up For Less Depth

Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi

Published: 28 Oct 2023, Last Modified: 29 Nov 2023WANT@NeurIPS 2023 OralEveryoneRevisionsBibTeX

Keywords: transformers, lanugage models, efficient models

Abstract: The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this study, we establish an approximate parallel between the utilization of the chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve comparable performance to that of a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers.

Submission Number: 42

Loading