Neural Networks and Solomonoff Induction

Jordi Grau-Moya; Tim Genewein; Marcus Hutter; Laurent Orseau; Gregoire Deletang; Elliot Catt; Anian Ruoss; Christopher Mattern; Li Kevin Wenliang; Matthew Aitchison; Joel Veness

Neural Networks and Solomonoff Induction

Jordi Grau-Moya, Tim Genewein, Marcus Hutter, Laurent Orseau, Gregoire Deletang, Elliot Catt, Anian Ruoss, Christopher Mattern, Li Kevin Wenliang, Matthew Aitchison, Joel Veness

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: transfer learning, meta learning, and lifelong learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Universal prediction, CTW, in-context learning, Turing machines, Transformers, Meta-Learning, Chomsky hierarchy

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Solomonoff Induction (SI) is the most powerful universal predictor given unlimited computational resources. Naive SI approximations are challenging and require running vast amount of programs for extremely long. Here we explore an alternative path to SI consisting in meta-training neural networks on universal data sources. We generate the training data by feeding random programs to Universal Turing Machines (UTMs) and guarantee convergence in the limit to various SI variants (under simplifying assumptions). We provide novel results on how a non-uniform distribution over programs still maintain the universality property. Experimentally, we investigate the effect neural network architectures (i.e. LSTMs, Transformers, etc.) and sizes on their performance on algorithmic data, crucial for SI. First, we consider variable-order Markov sources where the Bayes-optimal predictor is the well-known Context Tree Weighting (CTW) algorithm. Second, we evaluate on challenging algorithmic tasks on Chomsky hierarchy that require different memory structures. Finally, we test on the UTM domain following our theoretical results. We show that scaling network size always improves performance on all tasks, Transformers outperforming all others, even achieving optimality on par with CTW. Promisingly, large Transformers and LSTMs trained on UTM data exhibit transfer to the other domains.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7375

Loading