Extremely Fast Fine-Tuning for Cross Language Information Retrieval via Generalized Canonical Correlation

John M Conroy

07 Oct 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Recently, work using language-agnostic transformer neural sentence embeddings has shown promise for a robust multilingual sentence representation. Our submission to TREC was to test how well these embeddings could be fine-tuned cheaply to perform the task of cross-lingual information retrieval. We explore the use of the MS MARCO dataset with machine translations as a model problem. We demonstrate that a single generalized canonical correlation analysis (GCCA) model trained on previous queries significantly improves the ability of sentence embeddings to find relevant passages. The dominant computational cost for training is computing dense singular value decompositions (SVDs) of matrices derived from the fine-tuning training data. (The number of SVDs used is the number of language retrieval views and query views plus 1). This approach illustrates that GCCA methods can be used as a rapid training alternative to fine-tuning a neural net, allowing models to be fine-tuned frequently based on a user’s previous queries. This model was then used to prepare submissions for the re-ranking NeuCLIR task.

0 Replies