jdb_idx

Crates.iojdb_idx
lib.rsjdb_idx
version0.1.1
created_at2025-12-23 21:04:38.867078+00
updated_at2025-12-23 21:04:38.867078+00
descriptionHigh-performance SSTable index engine / 高性能 SSTable 索引引擎
homepagehttps://github.com/js0/jdb/tree/main/jdb_idx
repositoryhttps://github.com/js0/jdb.git
max_upload_size
id2002340
size97,732
i18n.site (i18nsite)

documentation

README

English | 中文


jdb_idx : High-Performance SSTable Index for Key-Value Storage

Table of Contents

Introduction

jdb_idx is the SSTable index module for JDB (a key-value database). It provides efficient sorted key-value storage with sparse indexing, block-based organization, and LRU caching for high-performance read/write operations.

SSTable (Sorted String Table) is the core storage format used by LSM-Tree based databases. jdb_idx implements:

  • 4KB aligned data blocks for Direct I/O compatibility
  • Sparse index for O(log n) block location
  • Binary search within blocks for key lookup
  • Zero-copy deserialization using bytes::Bytes

Features

  • Block-based storage with 4KB alignment
  • Sparse index with binary search
  • LRU block cache with thread-safe access
  • Async I/O via compio runtime
  • Zero-copy data access
  • Big-endian serialization for portability

Installation

cargo add jdb_idx

Usage

Writing SST File

use jdb_idx::SstWriter;
use jdb_proto::ValPtr;

async fn write_sst() -> jdb_idx::Result<()> {
  let mut writer = SstWriter::create("data.sst").await?;

  // Keys must be added in sorted order
  for i in 0u32..1000 {
    let key = format!("key{i:05}");
    let ptr = ValPtr::new(1, i as u64 * 100, 50);
    writer.add(key.as_bytes(), &ptr).await?;
  }

  let footer = writer.finish().await?;
  println!("Wrote {} entries in {} blocks", footer.num_entries, footer.num_blocks);
  Ok(())
}

Reading SST File

use jdb_idx::{SstReader, BlockCache};

async fn read_sst() -> jdb_idx::Result<()> {
  let reader = SstReader::open(1, "data.sst").await?;
  let cache = BlockCache::new(1024);

  // Point query
  if let Some(ptr) = reader.get(b"key00500", &cache).await? {
    println!("Found: offset={}, len={}", ptr.offset, ptr.len);
  }

  // Full scan
  let mut iter = reader.iter();
  while let Some((key, ptr)) = iter.next().await? {
    println!("{:?} -> {:?}", key, ptr);
  }

  Ok(())
}

Architecture

SST File Layout

┌─────────────────────────────────────────┐
│         Data Blocks (4KB each)          │
│  ┌─────────────────────────────────┐    │
│  │ Block 0: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ Block 1: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ ...                             │    │
│  └─────────────────────────────────┘    │
├─────────────────────────────────────────┤
│         Index Block                     │
│  [Count:u32][KeyLen:u16][Key]...        │
├─────────────────────────────────────────┤
│         Footer (64 bytes)               │
│  [IndexOffset][IndexLen][NumBlocks]     │
│  [NumEntries][Padding][Magic:JDBX]      │
└─────────────────────────────────────────┘

Data Flow

graph TD
  A[SstWriter] -->|add key-ptr| B[BlockBuilder]
  B -->|block full| C[Flush to File]
  B -->|record first key| D[IndexBuilder]
  C --> E[Next Block]
  D -->|finish| F[Write Index]
  F --> G[Write Footer]

  H[SstReader] -->|open| I[Read Footer]
  I --> J[Load Index]
  J --> K[IndexSearcher]

  L[Query Key] --> K
  K -->|binary search| M[Block Index]
  M --> N[BlockCache]
  N -->|miss| O[Load from Disk]
  N -->|hit| P[BlockIter]
  O --> P
  P -->|linear search| Q[Result]

API Reference

Core Types

Type Description
SstWriter Async SST file writer, accepts sorted key-ptr pairs
SstReader Async SST file reader with point query and iteration
SstIter Lazy async iterator for full table scan
BlockBuilder Builds 4KB data blocks with key-ptr records
BlockIter Iterator for reading entries from block
IndexBuilder Builds sparse index from first keys
IndexSearcher Binary search over sparse index
BlockCache Thread-safe LRU cache for data blocks
Footer SST file metadata (64 bytes)

Constants

Constant Value Description
BLOCK_SIZE 4096 Data block size in bytes
FOOTER_SIZE 64 Footer size in bytes

Error Types

Error Description
InvalidFooter Footer format invalid
MagicMismatch Magic number mismatch
InvalidIndex Index format invalid
EmptySst SST file has no entries
FileTooSmall File smaller than minimum size
KeyTooLarge Key exceeds block capacity
IndexTooLarge Index entry count overflow

Directory Structure

jdb_idx/
├── src/
│   ├── lib.rs      # Module exports
│   ├── block.rs    # BlockBuilder, BlockIter
│   ├── cache.rs    # BlockCache (LRU)
│   ├── error.rs    # Error types
│   ├── index.rs    # IndexBuilder, IndexSearcher
│   ├── layout.rs   # Footer, constants
│   ├── reader.rs   # SstReader, SstIter
│   └── writer.rs   # SstWriter
├── tests/
│   └── main.rs     # Integration tests
└── Cargo.toml

Tech Stack

Component Library Purpose
Async I/O compio io_uring based async runtime
Buffer bytes Zero-copy byte buffer
Cache lru + parking_lot Thread-safe LRU cache
Error thiserror Error type derivation
Alloc jdb_alloc Page-aligned buffer allocation
File jdb_fs Direct I/O file operations

History

The SSTable format originated from Google's Bigtable paper (2006), which introduced the LSM-Tree architecture for handling write-heavy workloads. The key insight was separating writes (sequential, fast) from reads (random, slower) by buffering writes in memory and periodically flushing sorted runs to disk.

LevelDB (2011) brought SSTable to the open-source world, followed by RocksDB (2012) which added numerous optimizations. The 4KB block size aligns with filesystem page size and SSD erase blocks, enabling Direct I/O to bypass kernel page cache for predictable latency.

The sparse index technique trades memory for disk seeks: instead of indexing every key, only the first key of each block is stored. This reduces index size by ~100x while adding at most one extra disk read per query.


About

This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.

We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:


jdb_idx : 高性能 SSTable 索引引擎

目录

简介

jdb_idx 是 JDB 键值数据库的 SSTable 索引模块。提供高效的有序键值存储,支持稀疏索引、块组织和 LRU 缓存,实现高性能读写操作。

SSTable(Sorted String Table)是 LSM-Tree 数据库的核心存储格式。jdb_idx 实现了:

  • 4KB 对齐数据块,兼容 Direct I/O
  • 稀疏索引,O(log n) 块定位
  • 块内二分查找
  • 基于 bytes::Bytes 的零拷贝反序列化

特性

  • 4KB 对齐的块存储
  • 稀疏索引 + 二分查找
  • 线程安全的 LRU 块缓存
  • 基于 compio 的异步 I/O
  • 零拷贝数据访问
  • 大端序列化,跨平台兼容

安装

cargo add jdb_idx

使用示例

写入 SST 文件

use jdb_idx::SstWriter;
use jdb_proto::ValPtr;

async fn write_sst() -> jdb_idx::Result<()> {
  let mut writer = SstWriter::create("data.sst").await?;

  // Key 必须按序添加
  for i in 0u32..1000 {
    let key = format!("key{i:05}");
    let ptr = ValPtr::new(1, i as u64 * 100, 50);
    writer.add(key.as_bytes(), &ptr).await?;
  }

  let footer = writer.finish().await?;
  println!("写入 {} 条记录,共 {} 块", footer.num_entries, footer.num_blocks);
  Ok(())
}

读取 SST 文件

use jdb_idx::{SstReader, BlockCache};

async fn read_sst() -> jdb_idx::Result<()> {
  let reader = SstReader::open(1, "data.sst").await?;
  let cache = BlockCache::new(1024);

  // 点查询
  if let Some(ptr) = reader.get(b"key00500", &cache).await? {
    println!("找到: offset={}, len={}", ptr.offset, ptr.len);
  }

  // 全表扫描
  let mut iter = reader.iter();
  while let Some((key, ptr)) = iter.next().await? {
    println!("{:?} -> {:?}", key, ptr);
  }

  Ok(())
}

架构设计

SST 文件布局

┌─────────────────────────────────────────┐
│         数据块 (每块 4KB)                │
│  ┌─────────────────────────────────┐    │
│  │ Block 0: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ Block 1: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ ...                             │    │
│  └─────────────────────────────────┘    │
├─────────────────────────────────────────┤
│         索引块                          │
│  [Count:u32][KeyLen:u16][Key]...        │
├─────────────────────────────────────────┤
│         Footer (64 字节)                │
│  [IndexOffset][IndexLen][NumBlocks]     │
│  [NumEntries][Padding][Magic:JDBX]      │
└─────────────────────────────────────────┘

数据流程

graph TD
  A[SstWriter] -->|add key-ptr| B[BlockBuilder]
  B -->|块满| C[刷盘]
  B -->|记录首 key| D[IndexBuilder]
  C --> E[下一块]
  D -->|finish| F[写入索引]
  F --> G[写入 Footer]

  H[SstReader] -->|open| I[读取 Footer]
  I --> J[加载索引]
  J --> K[IndexSearcher]

  L[查询 Key] --> K
  K -->|二分查找| M[块索引]
  M --> N[BlockCache]
  N -->|未命中| O[从磁盘加载]
  N -->|命中| P[BlockIter]
  O --> P
  P -->|线性查找| Q[结果]

API 参考

核心类型

类型 说明
SstWriter 异步 SST 写入器,接收有序 key-ptr 对
SstReader 异步 SST 读取器,支持点查和迭代
SstIter 惰性异步迭代器,用于全表扫描
BlockBuilder 构建 4KB 数据块
BlockIter 块内条目迭代器
IndexBuilder 从首 key 构建稀疏索引
IndexSearcher 稀疏索引二分查找
BlockCache 线程安全 LRU 块缓存
Footer SST 文件元数据 (64 字节)

常量

常量 说明
BLOCK_SIZE 4096 数据块大小 (字节)
FOOTER_SIZE 64 Footer 大小 (字节)

错误类型

错误 说明
InvalidFooter Footer 格式无效
MagicMismatch 魔数不匹配
InvalidIndex 索引格式无效
EmptySst SST 文件为空
FileTooSmall 文件小于最小尺寸
KeyTooLarge Key 超出块容量
IndexTooLarge 索引条目数溢出

目录结构

jdb_idx/
├── src/
│   ├── lib.rs      # 模块导出
│   ├── block.rs    # BlockBuilder, BlockIter
│   ├── cache.rs    # BlockCache (LRU)
│   ├── error.rs    # 错误类型
│   ├── index.rs    # IndexBuilder, IndexSearcher
│   ├── layout.rs   # Footer, 常量
│   ├── reader.rs   # SstReader, SstIter
│   └── writer.rs   # SstWriter
├── tests/
│   └── main.rs     # 集成测试
└── Cargo.toml

技术栈

组件 用途
异步 I/O compio 基于 io_uring 的异步运行时
缓冲区 bytes 零拷贝字节缓冲
缓存 lru + parking_lot 线程安全 LRU 缓存
错误 thiserror 错误类型派生
内存分配 jdb_alloc 页对齐缓冲区分配
文件 jdb_fs Direct I/O 文件操作

历史故事

SSTable 格式源自 Google 2006 年发表的 Bigtable 论文,该论文提出了 LSM-Tree 架构来处理写密集型负载。核心思想是将写入(顺序、快速)与读取(随机、较慢)分离:写入先缓存在内存中,定期刷新为磁盘上的有序文件。

LevelDB(2011)将 SSTable 带入开源世界,随后 RocksDB(2012)添加了大量优化。4KB 块大小与文件系统页大小和 SSD 擦除块对齐,使 Direct I/O 能够绕过内核页缓存,获得可预测的延迟。

稀疏索引技术以内存换磁盘寻道:不索引每个 key,只存储每块的首 key。这将索引大小减少约 100 倍,同时每次查询最多增加一次磁盘读取。

有趣的是,"SSTable" 这个名字中的 "String" 其实是历史遗留。在 Bigtable 中,所有数据都被视为字节串(byte string),因此得名。现代实现中,SSTable 存储的是任意二进制键值对,早已超越了字符串的范畴。


关于

本项目为 js0.site ⋅ 重构互联网计划 的开源组件。

我们正在以组件化的方式重新定义互联网的开发范式,欢迎关注:

Commit count: 0

cargo fmt