Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules

Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules

ACL ARR 2025 July Submission1465 Authors

29 Jul 2025 (modified: 07 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE's multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8×7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Mixture-of-Experts, parameter-efficient-training, fine-tuning, LLM, efficiency

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: In the Limitations section we discussed the potential risks including bias, societal risks, and environmental impact.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: In section 3 we cited creators of all major artifacts used including the 2 opensource LLMs and all 14 benchmark datasets used in evaluation.

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: All models and datasets are licensed under Apache 2.0 license, which permits research and educational use consistent with this study's methodology.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: In section 3 we elaborated our usage of the mentioned opensource models and datasets, which aligns with their licensed usage for research and educational purposes.

B4 Data Contains Personally Identifying Info Or Offensive Content: Yes

B4 Elaboration: In the Limitations section we acknowledge that the training data may contain demographic and geographic skews inherited from web corpora. We did not implement special procedures to check for PII or offensive content beyond what is standard in the field. We used well-established, publicly available models (OLMoE-1B-7B, Mixtral-8×7B) and benchmark datasets that are widely adopted in NLP research. Following standard practice in parameter-efficient fine-tuning research, we utilized these artifacts as-is for educational and research purposes only. The models and datasets have undergone the data curation processes implemented by their original creators, and our work focuses on methodological improvements to fine-tuning techniques rather than data collection or curation.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: All opensource models and datasets have comprehensive documentation provided in their corresponding papers which we have cited in section 3.

B6 Statistics For Data: Yes

B6 Elaboration: Section 3.1 provides relevant statistics including the use of Commonsense170K and Math50K training sets, evaluation across 14 tasks, and details about model sizes and parameters.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: In Section 3.1 and Appendix A.1 we report model parameters (active / trainable / total parameters) and computing infrastructure (single NVIDIA A100 GPU for OLMoE, 4×NVIDIA H100 GPUs for Mixtral).

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 3.1 describes the experimental setup including datasets, baselines, and evaluation methodology. Appendix A.1 and Table 2 provide detailed hyperparameter configurations including learning rates, batch sizes, epochs, and optimizer settings for both models.

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 3 and all major results report average performance calculated across individual benchmark test sets (8 commonsense reasoning and 6 arithmetic reasoning tasks), following established evaluation practices in the field.

C4 Parameters For Packages: Yes

C4 Elaboration: Section 3.1 mentions using benchmark suites and evaluation frameworks from Hu et al. (2023) , and Appendix A.1 mentions specific training frameworks.

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 1465

Loading