Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion

Dylan Zhang; Curt Tigges; Zory Zhang; Stella Biderman; Maxim Raginsky; Talia Ringer

Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion

Dylan Zhang, Curt Tigges, Zory Zhang, Stella Biderman, Maxim Raginsky, Talia Ringer

Published: 24 Jul 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating program behavior. We introduce a general framework that nicely connects the abstract concepts of structural recursion in the programming language domain to concrete sequence modeling problems and learned models' behavior. The framework includes a representation that captures the general \textit{syntax} of structural recursion, coupled with two different frameworks for understanding their \textit{semantics}---one that is more natural from a programming languages perspective and one that helps bridge that perspective with a mechanistic understanding of the underlying transformer architecture. With our framework as a powerful conceptual tool, we identify different issues under various set-ups. The models trained to emulate recursive computations cannot fully capture the recursion yet instead fit short-cut algorithms and thus cannot solve certain edge cases that are under-represented in the training distribution. In addition, it is difficult for state-of-the-art large language models (LLMs) to mine recursive rules from in-context demonstrations. Meanwhile, these LLMs fail in interesting ways when emulating reduction (step-wise computation) of the recursive function.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission:

Clarifications and Typos

Clarified the number of training sequences used (Section 6.1.2): We used all 2n2^n2n sequences.
Fixed a typo in the example in Section 6.2.
Clarified settings related to positional encoding.
Attributed the performance increase in tree traversal to an increase in model capacity (Section 8.4).
Addressed possible confusions in Sections 4.2.3, 7.1,7.4, 8.4, 10, and 11. (APjT)

Structure and Readability

Added more takeaway boxes across the empirical findings sections to highlight key points. Revised the result sections to eliminate unclarities regarding the takeaway messages to the best of our efforts.
Modified wording for improved clarity in the description of our framework in Section 2.
Improved wording in Section 4 for better clarity.
Connected claims and statements to specific evidence in evaluations to strengthen arguments.

Added Content

Added a Generalization subsection in the Related Works section (Section 12).
Discussed future works on decomposition and included references (Section 13).

Claims and Evidence

Added references to support the discussion of abstract state machines and improved clarity (Section 4.2) including the relationship between mechanistic interpretability and ASM.
Clarified the motivations for using the ASM framework (Section 4.2.3).
Provided examples of prevailing failure patterns of GPT-4 (Section 10).
Removed or rephrased less-substantiated claims throughout the paper to make the arguments more concise and accurate.

Assigned Action Editor: antonio vergari

Submission Number: 2200

Loading