Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz; Meisam Razaviyayn; Peilin Zhong; Vahab Mirrokni

Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

Published: 18 Sept 2025, Last Modified: 15 Jan 2026NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Learning Module, Deep Learning, Memory, Recurrent Neural Networks

Abstract: Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance the capability of machine learning models. Despite the recent progresses, particularly in developing Language Models (LMs), there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improved, and find ''effective solutions,''. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own ''context flow''. NL reveals that existing deep learning methods learns from data through \emph{compressing} their own context flow, and explain how in-context learning emerges in large models. NL suggests a path (a new dimension to deep learning) to design more expressive learning algorithms with more ''levels'', resulting in higher-order in-context learning abilities. In addition to its neuroscientifically plausible and mathematically white-box nature, we advocate for its importance by presenting three core contributions: (1) Deep Optimizers: Based on NL, we show that well-known gradient-based optimizers (e.g., Adam, SGD with Momentum, etc.) are in fact associative memory modules that aim to compress the gradients with gradient descent. Building on this insight, we present a set of more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Titans: Taking advantage of NL's insights on learning algorithms, we present a novel sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of ``long-term/short-term memory''. Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called Hope, showing promising results in language modeling, continual learning, and long-context reasoning tasks.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 17710

Loading