Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning

Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning

ACL ARR 2025 July Submission473 Authors

28 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The supervised fine-tuning (SFT) stage is crucial for multimodal large language models (MLLMs), yet a comprehensive scaling law to guide the optimal model-data configuration remains lacking. In this paper, we make an initial attempt to address this gap. First, we theoretically demonstrate that directly computing the optimal computation frontier for MLLM-SFT, as we can for traditional LLMs, is a challenging task. This complexity arises because MLLM-SFT is influenced by a broader range of factors, including model size, LLM pre-training tokens, and MLLM SFT tokens. To tackle this issue, we propose two scaling laws based on LLM paradigms: one applicable when training data volumes are well defined by researchers, and another for cases where models are sourced from open communities with unknown training data. Through theoretical modeling and approximations, we provide researchers with valuable recommendations for optimal resource allocation. Furthermore, we establish a strong correlation ($R^2$ = 0.98) between training loss and downstream performance, enabling accurate performance estimation without the need for exhaustive benchmarking. To validate our scaling laws, we construct a testbed of 60 models ranging from 50 million to 8 billion parameters, totaling 1,560 checkpoints. Each checkpoint is evaluated on than 10 MLLM benchmarks, ensuring robust fitting of our formulations.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: MLLM, Supervised Fine-tuning, Scaling law

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Data analysis, Theory

Languages Studied: English, Chinese

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 4

B2 Discuss The License For Artifacts: No

B2 Elaboration: The models and data used are open source.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: The models and data used are open source for research.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: 4.3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 4.2

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 4.1 and 5

C3 Descriptive Statistics: Yes

C3 Elaboration: 6 and 7

C4 Parameters For Packages: Yes

C4 Elaboration: 4

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: Yes

D3 Elaboration: 4.3

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 473

Loading