Track: Search and retrieval-augmented AI
Keywords: Text Retrieval, Re-Ranking, Lightweighting, Flexibility
Abstract: Large language models (LLMs) provide powerful foundations to
perform fine-grained text re-ranking. However, they are often pro-
hibitive in reality due to constraints on computation bandwidth. In
this work, we propose a flexible architecture called Matroyshka
Re-Ranker, which is designed to facilitate runtime customiza-
tion of model layers and sequence lengths at each layer based on
users’ configurations. Consequently, the LLM-based re-rankers can
be made applicable across various real-world situations.
The increased flexibility may come at the cost of precision loss. To
address this problem, we introduce a suite of techniques to optimize
the performance. First, we propose cascaded self-distillation,
where each sub-architecture learns to preserve a precise re-ranking
performance from its super components, whose predictions can be
exploited as smooth and informative teacher signals. Second, we
design a factorized compensation mechanism, where two col-
laborative Low-Rank Adaptation modules, vertical and horizontal,
are jointly employed to compensate for the precision loss resulted
from arbitrary combinations of layer and sequence compression.
We perform comprehensive experiments based on the passage
and document retrieval datasets from MSMARCO, along with all
public datasets from BEIR benchmark. In our experiments, Ma-
tryoshka Re-Ranker substantially outperforms the existing meth-
ods, while effectively preserving its superior performance across
various forms of compression and different application scenarios.
Our source code has been uploaded to this anonymous repository
Submission Number: 1519
Loading