Abstract: Real-time makeup virtual try-on (VTO) on resource-constrained platforms like mobile devices and web browsers demands a delicate balance: models must be accurate enough for realistic results yet lightweight and fast enough for smooth performance. Existing approaches often rely on separate models for facial landmark detection and occlusion-aware segmentation, increasing complexity and hindering real-time performance. To address this, we propose a novel, unified model that performs both tasks within a single, highly efficient architecture. Specifically designed for VTO, our model offers enhanced accuracy around critical areas like the eyes and lips. We further optimize for real-time performance by leveraging temporal information: predictions from previous video frames guide current predictions, increasing parallelism and reducing inference time to as little as 16ms on an iPhone 14. Trained with a simplified pipeline, our unified model achieves accuracy comparable to state-of-the-art lightweight alignment models while maintaining a small footprint.
Loading