Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner; Alexander Caspar Jenke; Sebastian Bodenstedt; Fiona R. Kolbinger; Oliver L. Saldanha; Jakob Nikolas Kather; Martin Wagner; Stefanie Speidel

Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner, Alexander Caspar Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Oliver L. Saldanha, Jakob Nikolas Kather, Martin Wagner, Stefanie Speidel

Published: 14 Feb 2026, Last Modified: 14 Feb 2026MIDL 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Endoscopic Video Analysis, Federated Learning, Foundation Models, Surgical Data Science, Vision Transformers

TL;DR: FL-EndoViT is a framework that utilizes federated learning with adaptive optimization to train robust, privacy-preserving surgical foundation models that achieve performance comparable to centralized baselines without sharing patient data.

Abstract: Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), which adapts the Masked Autoencoder (MAE) pretraining strategy for FL, enhanced with adaptive Sharpness-Aware Minimization (FedSAM) to manage surgical data heterogeneity. Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work establishes FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer's critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.

Primary Subject Area: Federated Learning

Secondary Subject Area: Foundation Models

Registration Requirement: Yes

Reproducibility: https://github.com/KirchnerMax/FL-EndoViT

Visa & Travel: Yes

Read CFP & Author Instructions: Yes

Originality Policy: Yes

Single-blind & Not Under Review Elsewhere: Yes

LLM Policy: Yes

Submission Number: 148

Loading