A Boosting-Driven Model for Updatable Learned Indexes

A Boosting-Driven Model for Updatable Learned Indexes

ICLR 2026 Conference Submission14099 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model-based Index, ML for Systems, Information Retrieval, Learned Index, Neural Networks

TL;DR: Sigmoid-based model, a dynamic learned index that reduces retraining costs by 20× using sigmoid-boosting approximation and proactive workload modeling, achieving 3× higher throughput and 1000× lower memory usage

Abstract: Learned Indexes (LIs) represent a paradigm shift from traditional index structures by employing machine learning models to approximate the Cumulative Distribution Function (CDF) of sorted data. While LIs achieve remarkable efficiency for static datasets, their performance degrades under dynamic updates: maintaining the CDF invariant ($\sum F(k) = 1$) requires global model retraining, which blocks queries and limits the Queries-per-second (QPS) metric. Current approaches fail to address these retraining costs effectively, rendering them unsuitable for real-world workloads with frequent updates. In this paper, we present Sigmoid-based model, an efficient and adaptive learned index that minimizes retraining cost through three key techniques: (1) A Sigmoid boosting approximation technique that dynamically adjusts the index model by approximating update-induced shifts in data distribution with localized sigmoid functions that preserves the model’s $\epsilon$-bounded error guarantees while deferring full retraining. (2) Proactive update training via Gaussian Mixture Models (GMMs) that identifies high-update-probability regions for strategic placeholder allocation that speeds up updates coming on these slots. (3) A neural joint optimization framework that continuously refining both the sigmoid ensemble and GMM parameters via gradient-based learning. We rigorously evaluate our model against state-of-the-art updatable LIs on real-world and synthetic workloads, and show that it reduces retraining cost by $20\times$ while it shows up to $3\times$ higher QPS and $1000\times$ lower memory usage.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 14099

Loading