DFT-Trans: A Bidirectional Encoder for Efficient Fusion of Time-Frequency Domain Textual Features

DFT-Trans: A Bidirectional Encoder for Efficient Fusion of Time-Frequency Domain Textual Features

ACL ARR 2025 February Submission2926 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite the remarkable achievements of BERT-style encoder models in NLP research, the high computational costs make it challenging to pretrain specific BERTs from scratch. This work proposes a novel BERT-style encoder model called DFT-Trans, addressing the critical question of enhancing performance while reducing training costs. The DFT-Trans model is primarily composed of the trainable Fourier operator and the attention operator. The novel trainable Fourier operator, which consists of the unique Blending Token and Mixing Token methods, is developed, given that frequency domain features are seldom considered in text representation extraction. This operator utilizes fast Fourier transform(FFT) to capture data features in the frequency domain, integrating frequency information into the network’s structure and computations, enabling more robust feature extraction capabilities. The attention operator is designed by combining FlashAttention and Attention with Linear Bias to address the quadratic time and memory complexity inherent to self-attention while efficiently extracting features from time-domain data. When pretrained from scratch on large-scale corpora, DFT-Trans achieves an average downstream GLUE(dev) score of 80.6\% using a single RTX 4090 GPU in one day, with a cost of approximately \$5. Furthermore, we experimented on the Long-Range Arena(LRA) benchmark, where DFT-Trans achieved an average task score of 75.94\%, demonstrating its effectiveness in long-text scenarios. Code is available at this repository: https://anonymous.4open.science/r/DFT-Trans-3FDD.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Trainable Fourier operators, Attention operators, Bert-style, Training costs

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 2926

Loading