Abstract: High-dimensional embeddings from dense retrieval models pose challenges in information retrieval and reranking because they need a lot of resources and computing power. The TCT-ColBERT model, introduced recently, has made retrieval systems more efficient, especially for indexing documents offline. However, it makes the index bigger because it needs to store each token for a particular document. Dimensionality reduction can help by reducing storage needs, noise reduction, and simplifying computations. Dimensionality reduction methods are used in many fields, especially when the dataset grows. So, it is worth exploring these methods for dense retrieval models. Using specific dimensionality reduction techniques, we can prune the number of dimensions and focus on the most important parts of the text embeddings that affect similarity calculations. This research explores two linear dimensionality reduction methods: PCA (Principal Component Analysis) and truncated-SVD (truncated Singular Value Decomposition). The main objective was to find out the most important parts of the embeddings for determining the similarity in documents and queries. The MS MARCO passage collection (test-2019) was used for experiments. The results showed that it could reduce token embeddings up to 95% without appreciable loss in performance in terms of recall and precision.
Loading