FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Yan Gao; Massimo Roberto Scamarcia; Javier Fernandez-Marques; Mohammad Naseri; Chong Shen Ng; Dimitris Stripelis; Zexi Li; Tao Shen; Jiamu Bai; Daoyuan Chen; Zikai Zhang; Rui Hu; InSeo Song; Lee KangYoon; Hong Jia; Ting Dang; Junyan Wang; Zheyuan Liu; Daniel Janes Beutel; Lingjuan Lyu; Nicholas D. Lane

FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Federated Learning, LLM Fine-tuning, Benchmark

Abstract: Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

Croissant File: zip

Dataset URL: https://huggingface.co/datasets/flwrlabs/alpaca-gpt4;https://huggingface.co/datasets/flwrlabs/fingpt-sentiment-train;https://huggingface.co/datasets/flwrlabs/medical-meadow-medical-flashcards;https://huggingface.co/datasets/flwrlabs/code-alpaca-20k

Code URL: https://github.com/yan-gao-GY/flowertune-benchmark

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 540

Loading