Extremely Fast Fine-Tuning for Cross Language Information Retrieval via Generalized Canonical Correlation
Abstract: Recently, work using language-agnostic transformer neural sentence
embeddings has shown promise for a robust multilingual sentence representation. Our submission to TREC was to test how well these embeddings
could be fine-tuned cheaply to perform the task of cross-lingual information retrieval. We explore the use of the MS MARCO dataset with
machine translations as a model problem. We demonstrate that a single
generalized canonical correlation analysis (GCCA) model trained on previous queries significantly improves the ability of sentence embeddings to
find relevant passages. The dominant computational cost for training is
computing dense singular value decompositions (SVDs) of matrices derived from the fine-tuning training data. (The number of SVDs used is
the number of language retrieval views and query views plus 1). This
approach illustrates that GCCA methods can be used as a rapid training
alternative to fine-tuning a neural net, allowing models to be fine-tuned
frequently based on a user’s previous queries. This model was then used
to prepare submissions for the re-ranking NeuCLIR task.
0 Replies
Loading