An Efficient Multi-Task Transformer for 3D Face Alignment

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: vision transformer, facial landmark detection, 3D face alignment
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: The proposed Trans3DHead is an efficient multi-task transformer for 3D face alignment, which removes the dependence on high-resolution feature maps and is effective to achieve information communication among different vertices or 3DMM parameters.
Abstract: In the research of 3D face alignment, few prior works focus on information exchange among different vertices or 3DMM parameters in regression. On the other hand, there is a drawback that using high-resolution feature maps makes algorithms memory-consuming and not efficient. To solve these issues, we first propose a multi-task model equipped with two transformer-based branches which further enhances the information communication among different elements through self-attention and cross-attention mechanisms. To solve the problem of low efficiency of high-resolution feature maps and improve the accuracy of facial landmark detection, a lightweight module named query-aware memory (QAMem) is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one. With the help of QAMem, our model is efficient because of removing the dependence on high-resolution feature maps and is still able to obtain superior accuracy. To further improve the robustness of the predicted landmarks, we introduce a multi-layer additive residual regression (MARR) module that can provide a more stable and reliable reference based on the average face model. Furthermore, the multi-information loss function with Euler Angles Loss is proposed to supervise the network with more effective information, making the model more robust to handle the case of atypical head poses. Extensive experiments on two public benchmarks show that our approach can achieve state-of-the-art performance. Besides, visualization results and ablation experiments verify the effectiveness of the proposed model.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5361
Loading