A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification
AIPR assessment
This is a hard and competitive systems problem, not an uncrowded niche, because learned indexes in storage engines have been explored by multiple groups and the baseline systems are nontrivial. The strengths reinforce each other: the method is simple enough to integrate, the gains are measured on realistic workloads, and the overhead analysis supports the claim that the added machinery is not disruptive. The main weaknesses also interact: no public artifact in the paper, no significance analysis
Abstract
Learned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage by approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.
Score Breakdown
More from this week
- Optimus: Elastic Decoding for Efficient Diffusion LLM Serving
- TubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery
- Context Features Are Cheap: Rank-Aware Decomposition for Efficient Feature Interaction in Recommender Systems
- Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
- Learning High-Frequency Continuous Action Chunks in Latent Space