jdb_idx

Crates.io	jdb_idx
lib.rs	jdb_idx
version	0.1.1
created_at	2025-12-23 21:04:38.867078+00
updated_at	2025-12-23 21:04:38.867078+00
description	High-performance SSTable index engine / 高性能 SSTable 索引引擎
homepage	https://github.com/js0/jdb/tree/main/jdb_idx
repository	https://github.com/js0/jdb.git
max_upload_size
id	2002340
size	97,732

i18n.site (i18nsite)

documentation

README

English | 中文

jdb_idx : High-Performance SSTable Index for Key-Value Storage

Introduction
Features
Installation
Usage
Architecture
API Reference
Directory Structure
Tech Stack
History

Introduction

jdb_idx is the SSTable index module for JDB (a key-value database). It provides efficient sorted key-value storage with sparse indexing, block-based organization, and LRU caching for high-performance read/write operations.

SSTable (Sorted String Table) is the core storage format used by LSM-Tree based databases. jdb_idx implements:

4KB aligned data blocks for Direct I/O compatibility
Sparse index for O(log n) block location
Binary search within blocks for key lookup
Zero-copy deserialization using bytes::Bytes

Features

Block-based storage with 4KB alignment
Sparse index with binary search
LRU block cache with thread-safe access
Async I/O via compio runtime
Zero-copy data access
Big-endian serialization for portability

Installation

cargo add jdb_idx

Usage

Writing SST File

use jdb_idx::SstWriter;
use jdb_proto::ValPtr;

async fn write_sst() -> jdb_idx::Result<()> {
  let mut writer = SstWriter::create("data.sst").await?;

  // Keys must be added in sorted order
  for i in 0u32..1000 {
    let key = format!("key{i:05}");
    let ptr = ValPtr::new(1, i as u64 * 100, 50);
    writer.add(key.as_bytes(), &ptr).await?;
  }

  let footer = writer.finish().await?;
  println!("Wrote {} entries in {} blocks", footer.num_entries, footer.num_blocks);
  Ok(())
}

Reading SST File

use jdb_idx::{SstReader, BlockCache};

async fn read_sst() -> jdb_idx::Result<()> {
  let reader = SstReader::open(1, "data.sst").await?;
  let cache = BlockCache::new(1024);

  // Point query
  if let Some(ptr) = reader.get(b"key00500", &cache).await? {
    println!("Found: offset={}, len={}", ptr.offset, ptr.len);
  }

  // Full scan
  let mut iter = reader.iter();
  while let Some((key, ptr)) = iter.next().await? {
    println!("{:?} -> {:?}", key, ptr);
  }

  Ok(())
}

Architecture

SST File Layout

┌─────────────────────────────────────────┐
│         Data Blocks (4KB each)          │
│  ┌─────────────────────────────────┐    │
│  │ Block 0: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ Block 1: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ ...                             │    │
│  └─────────────────────────────────┘    │
├─────────────────────────────────────────┤
│         Index Block                     │
│  [Count:u32][KeyLen:u16][Key]...        │
├─────────────────────────────────────────┤
│         Footer (64 bytes)               │
│  [IndexOffset][IndexLen][NumBlocks]     │
│  [NumEntries][Padding][Magic:JDBX]      │
└─────────────────────────────────────────┘

Data Flow

graph TD
  A[SstWriter] -->|add key-ptr| B[BlockBuilder]
  B -->|block full| C[Flush to File]
  B -->|record first key| D[IndexBuilder]
  C --> E[Next Block]
  D -->|finish| F[Write Index]
  F --> G[Write Footer]

  H[SstReader] -->|open| I[Read Footer]
  I --> J[Load Index]
  J --> K[IndexSearcher]

  L[Query Key] --> K
  K -->|binary search| M[Block Index]
  M --> N[BlockCache]
  N -->|miss| O[Load from Disk]
  N -->|hit| P[BlockIter]
  O --> P
  P -->|linear search| Q[Result]

API Reference

Core Types

Type	Description
`SstWriter`	Async SST file writer, accepts sorted key-ptr pairs
`SstReader`	Async SST file reader with point query and iteration
`SstIter`	Lazy async iterator for full table scan
`BlockBuilder`	Builds 4KB data blocks with key-ptr records
`BlockIter`	Iterator for reading entries from block
`IndexBuilder`	Builds sparse index from first keys
`IndexSearcher`	Binary search over sparse index
`BlockCache`	Thread-safe LRU cache for data blocks
`Footer`	SST file metadata (64 bytes)

Constants

Constant	Value	Description
`BLOCK_SIZE`	4096	Data block size in bytes
`FOOTER_SIZE`	64	Footer size in bytes

Error Types

Error	Description
`InvalidFooter`	Footer format invalid
`MagicMismatch`	Magic number mismatch
`InvalidIndex`	Index format invalid
`EmptySst`	SST file has no entries
`FileTooSmall`	File smaller than minimum size
`KeyTooLarge`	Key exceeds block capacity
`IndexTooLarge`	Index entry count overflow

Directory Structure

jdb_idx/
├── src/
│   ├── lib.rs      # Module exports
│   ├── block.rs    # BlockBuilder, BlockIter
│   ├── cache.rs    # BlockCache (LRU)
│   ├── error.rs    # Error types
│   ├── index.rs    # IndexBuilder, IndexSearcher
│   ├── layout.rs   # Footer, constants
│   ├── reader.rs   # SstReader, SstIter
│   └── writer.rs   # SstWriter
├── tests/
│   └── main.rs     # Integration tests
└── Cargo.toml

Tech Stack

Component	Library	Purpose
Async I/O	compio	io_uring based async runtime
Buffer	bytes	Zero-copy byte buffer
Cache	lru + parking_lot	Thread-safe LRU cache
Error	thiserror	Error type derivation
Alloc	jdb_alloc	Page-aligned buffer allocation
File	jdb_fs	Direct I/O file operations

History

The SSTable format originated from Google's Bigtable paper (2006), which introduced the LSM-Tree architecture for handling write-heavy workloads. The key insight was separating writes (sequential, fast) from reads (random, slower) by buffering writes in memory and periodically flushing sorted runs to disk.

LevelDB (2011) brought SSTable to the open-source world, followed by RocksDB (2012) which added numerous optimizations. The 4KB block size aligns with filesystem page size and SSD erase blocks, enabling Direct I/O to bypass kernel page cache for predictable latency.

The sparse index technique trades memory for disk seeks: instead of indexing every key, only the first key of each block is stored. This reduces index size by ~100x while adding at most one extra disk read per query.

About

This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.

We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:

jdb_idx : 高性能 SSTable 索引引擎

简介

jdb_idx 是 JDB 键值数据库的 SSTable 索引模块。提供高效的有序键值存储，支持稀疏索引、块组织和 LRU 缓存，实现高性能读写操作。

SSTable（Sorted String Table）是 LSM-Tree 数据库的核心存储格式。jdb_idx 实现了：

4KB 对齐数据块，兼容 Direct I/O
稀疏索引，O(log n) 块定位
块内二分查找
基于 bytes::Bytes 的零拷贝反序列化

特性

4KB 对齐的块存储
稀疏索引 + 二分查找
线程安全的 LRU 块缓存
基于 compio 的异步 I/O
零拷贝数据访问
大端序列化，跨平台兼容

安装

cargo add jdb_idx

使用示例

写入 SST 文件

use jdb_idx::SstWriter;
use jdb_proto::ValPtr;

async fn write_sst() -> jdb_idx::Result<()> {
  let mut writer = SstWriter::create("data.sst").await?;

  // Key 必须按序添加
  for i in 0u32..1000 {
    let key = format!("key{i:05}");
    let ptr = ValPtr::new(1, i as u64 * 100, 50);
    writer.add(key.as_bytes(), &ptr).await?;
  }

  let footer = writer.finish().await?;
  println!("写入 {} 条记录，共 {} 块", footer.num_entries, footer.num_blocks);
  Ok(())
}

读取 SST 文件

use jdb_idx::{SstReader, BlockCache};

async fn read_sst() -> jdb_idx::Result<()> {
  let reader = SstReader::open(1, "data.sst").await?;
  let cache = BlockCache::new(1024);

  // 点查询
  if let Some(ptr) = reader.get(b"key00500", &cache).await? {
    println!("找到: offset={}, len={}", ptr.offset, ptr.len);
  }

  // 全表扫描
  let mut iter = reader.iter();
  while let Some((key, ptr)) = iter.next().await? {
    println!("{:?} -> {:?}", key, ptr);
  }

  Ok(())
}

架构设计

SST 文件布局

┌─────────────────────────────────────────┐
│         数据块 (每块 4KB)                │
│  ┌─────────────────────────────────┐    │
│  │ Block 0: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ Block 1: [Count][Records...]    │    │
│  ├─────────────────────────────────┤    │
│  │ ...                             │    │
│  └─────────────────────────────────┘    │
├─────────────────────────────────────────┤
│         索引块                          │
│  [Count:u32][KeyLen:u16][Key]...        │
├─────────────────────────────────────────┤
│         Footer (64 字节)                │
│  [IndexOffset][IndexLen][NumBlocks]     │
│  [NumEntries][Padding][Magic:JDBX]      │
└─────────────────────────────────────────┘

数据流程

graph TD
  A[SstWriter] -->|add key-ptr| B[BlockBuilder]
  B -->|块满| C[刷盘]
  B -->|记录首 key| D[IndexBuilder]
  C --> E[下一块]
  D -->|finish| F[写入索引]
  F --> G[写入 Footer]

  H[SstReader] -->|open| I[读取 Footer]
  I --> J[加载索引]
  J --> K[IndexSearcher]

  L[查询 Key] --> K
  K -->|二分查找| M[块索引]
  M --> N[BlockCache]
  N -->|未命中| O[从磁盘加载]
  N -->|命中| P[BlockIter]
  O --> P
  P -->|线性查找| Q[结果]

API 参考

核心类型

类型	说明
`SstWriter`	异步 SST 写入器，接收有序 key-ptr 对
`SstReader`	异步 SST 读取器，支持点查和迭代
`SstIter`	惰性异步迭代器，用于全表扫描
`BlockBuilder`	构建 4KB 数据块
`BlockIter`	块内条目迭代器
`IndexBuilder`	从首 key 构建稀疏索引
`IndexSearcher`	稀疏索引二分查找
`BlockCache`	线程安全 LRU 块缓存
`Footer`	SST 文件元数据 (64 字节)

常量

常量	值	说明
`BLOCK_SIZE`	4096	数据块大小 (字节)
`FOOTER_SIZE`	64	Footer 大小 (字节)

错误类型

错误	说明
`InvalidFooter`	Footer 格式无效
`MagicMismatch`	魔数不匹配
`InvalidIndex`	索引格式无效
`EmptySst`	SST 文件为空
`FileTooSmall`	文件小于最小尺寸
`KeyTooLarge`	Key 超出块容量
`IndexTooLarge`	索引条目数溢出

目录结构

jdb_idx/
├── src/
│   ├── lib.rs      # 模块导出
│   ├── block.rs    # BlockBuilder, BlockIter
│   ├── cache.rs    # BlockCache (LRU)
│   ├── error.rs    # 错误类型
│   ├── index.rs    # IndexBuilder, IndexSearcher
│   ├── layout.rs   # Footer, 常量
│   ├── reader.rs   # SstReader, SstIter
│   └── writer.rs   # SstWriter
├── tests/
│   └── main.rs     # 集成测试
└── Cargo.toml

技术栈

组件	库	用途
异步 I/O	compio	基于 io_uring 的异步运行时
缓冲区	bytes	零拷贝字节缓冲
缓存	lru + parking_lot	线程安全 LRU 缓存
错误	thiserror	错误类型派生
内存分配	jdb_alloc	页对齐缓冲区分配
文件	jdb_fs	Direct I/O 文件操作

历史故事

SSTable 格式源自 Google 2006 年发表的 Bigtable 论文，该论文提出了 LSM-Tree 架构来处理写密集型负载。核心思想是将写入（顺序、快速）与读取（随机、较慢）分离：写入先缓存在内存中，定期刷新为磁盘上的有序文件。

LevelDB（2011）将 SSTable 带入开源世界，随后 RocksDB（2012）添加了大量优化。4KB 块大小与文件系统页大小和 SSD 擦除块对齐，使 Direct I/O 能够绕过内核页缓存，获得可预测的延迟。

稀疏索引技术以内存换磁盘寻道：不索引每个 key，只存储每块的首 key。这将索引大小减少约 100 倍，同时每次查询最多增加一次磁盘读取。

有趣的是，"SSTable" 这个名字中的 "String" 其实是历史遗留。在 Bigtable 中，所有数据都被视为字节串（byte string），因此得名。现代实现中，SSTable 存储的是任意二进制键值对，早已超越了字符串的范畴。

关于

本项目为 js0.site ⋅ 重构互联网计划的开源组件。

我们正在以组件化的方式重新定义互联网的开发范式，欢迎关注：

Commit count: 0