BLADE: Combining Vocabulary Pruning and Intermediate Pretraining for Scaleable Neural CLIR

Suraj Nair, Eugene Yang, Dawn J. Lawrie, James Mayfield, Douglas W. Oard

Published: 01 Jan 2023, Last Modified: 07 Oct 2025SIGIR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Learning sparse representations using pretrained language models enhances the monolingual ranking effectiveness. Such representations are sparse vectors in the vocabulary of a language model projected from document terms. Extending such approaches to Cross-Language Information Retrieval (CLIR) using multilingual pretrained language models poses two challenges. First, the larger vocabularies of multilingual models affect both training and inference efficiency. Second, the representations of terms from different languages with similar meanings might not be sufficiently similar. To address these issues, we propose a learned sparse representation model, BLADE, combining vocabulary pruning with intermediate pre-training based on cross-language supervision. Our experiments reveal BLADE significantly reduces indexing time compared to its monolingual counterpart, SPLADE, on machine-translated documents, and it generates rankings with strengths complementary to those of other efficient CLIR methods.