ECHOVIT: Vision Transformers Using Fast-And-Slow Time Embeddings

Published: 01 Jan 2023, Last Modified: 08 May 2025IGARSS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper details the preliminary efforts of applying the deep learning transformer architecture to automatically track annual layer stratigraphy in echogram images obtained from mapping near-surface ice layers using airborne radars. Following the success of the transformer architecture in the natural language processing and computer vision communities, we explore a variant termed Echogram Vision Transformer (EchoViT) on the radar echogram layer tracking (RELT) problem. The proposed approach divides the echogram images into patches using different schemes inspired by tokenization methods in natural language processing. We then apply a soft-attention mechanism to model interdependencies between the patches, capturing spatiotemporal stratigraphic information. Experiments conducted on the CREED dataset demonstrate the superiority of transformer-based architectures over existing convolutional-based architectures. Furthermore, the EchoViT fast-time and EchoViT slow-time patchifying schemes achieved precise tracking of the layers with submeter MAE of 3.39 and 3.55, respectively, while the use of cropped patches led to suboptimal results.
Loading