MFT: Memory-Aware Fine-Tuning of SAM2 for Efficient Long-Sequence Video Object Segmentation

Hao Yuan

Published: 09 Feb 2026, Last Modified: 17 Mar 2026IEEE Signal Processing LettersEveryoneCC BY-NC-ND 4.0

Abstract: Adapting large foundation segmentation models such as SAM2 to video object segmentation (VOS), especially for long sequences, is often limited by the high training cost of full fine-tuning. We present Memory-Aware Fine-tuning (MFT), a parameter-efficient adaptation strategy that freezes the image encoder and selectively updates only memory-related modules responsible for temporal reasoning (memory attention, mask decoder, and memory encoder). To improve robustness on extended sequences, we further introduce a lightweight Memory Compression (MC) network that periodically condenses short-term memory embeddings into a compact long-term representation, keeping the memory bank bounded while preserving historical context. Extensive experiments demonstrate that MFT achieves state-of-the-art results on long-sequence benchmarks while maintaining efficient resource utilization, offering an accessible approach for fine-tuning SAM2 on resource-constrained hardware.