A Length Enhanced B-Tree Based Index for Efficient Set Similarity Query

Published: 2025, Last Modified: 15 Jan 2026ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Set Similarity Query (SSQ) is widely applied in various fields. The existing B+-tree-based SSQ approaches fail to fully exploit length filtering and require calculating similarity bounds in a node-wise manner, leading to low efficiency. To address these issues, we propose LeB, a novel length-enhanced B+-tree index, whose keys integrate set lengths and bucket mapping, enabling the direct pruning of sets that do not meet the length requirements. Building upon LeB, we present an efficient algorithm, LeBQ, which leverages length filtering and symmetric difference allocation to determine the key bounds for a query, enabling the key bounds computation only once for each query $Q$ and avoiding costly similarity bounds computation in a node-wise manner. Efficient key filtering strategies are proposed to prune sets that cannot be similar, significantly reducing the number of candidates. Based on LeBQ, LeBQ+ further reduces the number of candidates by introducing length-independent key bounds. Experimental results on four real datasets demonstrate that LeBQ+ has a higher node access efficiency and accesses only 3.08% to 27.47% nodes compared to the existing B+ -tree-based SSQ algorithm. LeBQ+ is up to 99.8 × faster than the state-of-the-art algorithms.
Loading