Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Abstract: Large language models (LLMs) operate in two learning modes: fine-tuning (FT) and in-context learning (ICL). We ask which mode exhibits greater language proficiency, and whether their inductive biases in pattern recognition differ. We propose three desiderata for the comparison: (D1) a precise specification of the learning task, (D2) an equal resource allocation to FT and ICL, and (D3) a comparable evaluation metric to find the better mode. Several prior studies attempted to compare FT and ICL without satisfying all three desiderata, resulting in mixed and inconclusive results. To satisfy these desiderata, we propose a formal language learning task, where syntactic pattern recognition is the main focus. We also introduce a discriminative test for language proficiency, enabling direct comparison of FT and ICL.
Empirically, we find that (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive bias, measured as the correlation of string generation, is usually similar, but similarity decreases with better language learning. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families, and becomes sensitive to tokens used in the languages. Thus, our controlled setup reveals subtle behavior of FT and ICL, which is difficult to capture in natural language datasets.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, metrics, evaluation
Contribution Types: Model analysis & interpretability, Reproduction study, Data resources, Data analysis
Languages Studied: Synthetic formal languages, English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
Software: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section: Limitations and Ethics Statement
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 2, 3, 5
B2 Discuss The License For Artifacts: No
B2 Elaboration: We use publicly available datasets and models. We further contribute with synthetic formal language based datasets.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section: Ethics Statement
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: Section 3 and Ethics Statement. We use synthetic formal languages with no semantics (or personally identifying info) involved.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3
B6 Statistics For Data: Yes
B6 Elaboration: Section 3 and Appendix B
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 3 and Appendix B
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3, 5, and Appendix B
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 3, 5, and Appendix C
C4 Parameters For Packages: Yes
C4 Elaboration: Section 3 and Appendix B
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D1 Elaboration: Not applicable
D2 Recruitment And Payment: N/A
D2 Elaboration: Not applicable
D3 Data Consent: N/A
D3 Elaboration: Not applicable
D4 Ethics Review Board Approval: N/A
D4 Elaboration: Not applicable
D5 Characteristics Of Annotators: N/A
D5 Elaboration: Not applicable
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We are using AI assistants only for grammar/spelling correction during code/paper writing.
Author Submission Checklist: yes
Submission Number: 523
Loading