Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that the recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in great training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall ability of RNNs without incurring high post-training costs and compromising other capabilities.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: continual learning, fine-tuning
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: The trained model is intended solely for academic research and will not be deployed in any real-world application. As such, a detailed risk analysis was not deemed necessary.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 1: Introduction
B2 Discuss The License For Artifacts: No
B2 Elaboration: All resources are public and free to use for scientific research.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: Our use of existing artifacts strictly follows common research practices and remains consistent with their intended use as originally specified.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: Our training data is publicly available and widely used, but we did not take additional steps to remove possible sensitive content.
B5 Documentation Of Artifacts: No
B5 Elaboration: We did not provide detailed documentation of the artifacts used or created. However, the data were used within clearly defined research boundaries, and no artifacts were shared externally. If shared in the future, appropriate documentation will be added, including details on domains, languages, and demographic coverage.
B6 Statistics For Data: Yes
B6 Elaboration: Section 5 and Appendix B.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 5
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 5
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5, Appendix B.2
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: We used AI for code autocompletion and checking for typos/grammatical errors during paper writing.
Author Submission Checklist: yes
Submission Number: 540
Loading