Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Samira Abnar; Omid Saremi; Laurent Dinh; Shantel Wilson; Miguel Ángel Bautista; Chen Huang; Vimal Thilak; Etai Littwin; Jiatao Gu; Joshua M. Susskind; Samy Bengio

Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Ángel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Joshua M. Susskind, Samy Bengio

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: multistep reasoning, generalization over example complexity; pointer value retrieval tasks; adaptive compute; modular compute;

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Adaptivity and modularity have complementary roles in improving the efficiency and generalization capabilities of transformers for multi-step reasoning tasks.

Abstract: Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indicate that standard transformers face challenges in solving these tasks. These tasks are variations of pointer value retrieval previously introduced by Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i.e., the depth of the computation graph). Based on our observations, we propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers. This model demonstrates higher accuracy and a fairer allocation of computational resources when generalizing to higher numbers of computation steps. We conclude that mechanisms for adaptive depth and modularity complement each other in improving efficient generalization concerning example complexity. Additionally, to emphasize the broad applicability of our findings, we illustrate that in a standard image recognition task, Hyper-UT's performance matches that of a ViT model but with considerably reduced computational demands (achieving over 70\% average savings by effectively using fewer layers).

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6471

Loading