Abstract: Accurate image matting requires an in-depth exploration of both the contextual information and the fine-grained details within input images. To this end, recent advancements in transformer-based matting incorporate three context tokens (triple-token), which are the representations of the three trimap regions, into the transformer structure. However, the triple-token, constrained by its limited information capacity, might not adequately capture the global context within the image, particularly for high-resolution inputs. In this paper, we introduce a transformer-based image matting model named ProxyMatting, which efficiently addresses the aforementioned issues by integrating context into the transformer block through the Region Proxies. The region proxies are the representation of predefined-sized regions but differ from the triple-token in two aspects: (1) Region proxies offer enriched contextual information, as the number of generated proxies (context tokens) scales proportionally with the resolution of the feature map. (2) Each proxy is responsible for gathering the necessary contextual information for its designated region and delivering it exclusively to the spatial tokens within that region. ProxyMatting benefits from incorporating region proxies, enabling each spatial token to efficiently extract contextual information from its corresponding proxy token. By restricting each spatial token to query from only one proxy token, ProxyMatting maintains computational efficiency compared to triple-token approaches. Experiments show that ProxyMatting demonstrates outstanding performance across standard matting datasets.
External IDs:dblp:journals/kbs/LiYWYYL25
Loading