MFT: Memory-Aware Fine-Tuning of SAM2 for Efficient Long-Sequence Video Object Segmentation
Abstract: Adapting large foundation segmentation models such as SAM2 to video object segmentation (VOS), especially for long sequences, is often limited by the high training cost of full fine-tuning. We present Memory-Aware Fine-tuning (MFT), a parameter-efficient adaptation strategy that freezes the image encoder and selectively updates only memory-related modules responsible for temporal reasoning (memory attention, mask decoder, and memory encoder). To improve robustness on extended sequences, we further introduce a lightweight Memory Compression (MC) network that periodically condenses short-term memory embeddings into a compact long-term representation, keeping the memory bank bounded while preserving historical context. Extensive experiments demonstrate that MFT achieves state-of-the-art results on long-sequence benchmarks while maintaining efficient resource utilization, offering an accessible approach for fine-tuning SAM2 on resource-constrained hardware.
Loading