Abstract: Highlights•A self-supervised approach based on vision transformers for keyword spotting.•Learning useful representations from unlabeled data using a masked auto-encoder.•Improving feature embedding using a two-stage downstream task.•The proposed approach outperforms state-of-the-art approaches on three datasets.
Loading