Abstract: Large language models excel at solving complex tasks, owing to their hierarchical architecture that enables the implementation of sophisticated algorithms through layered computations. In this work, we study the interplay between model depth and data complexity using elementary cellular automata (ECA) datasets. We demonstrate empirically that, given a fixed parameter count, deeper networks consistently outperform shallower variants. Our findings reveal that complex ECA rules require a deeper model to emulate. Finally, analysis of attention score patterns elucidates why shallower networks struggle to effectively emulate complex rules.
Style Files: I have used the style files.
Submission Number: 66
Loading