sbitmap

Crates.io	sbitmap
lib.rs	sbitmap
version	0.1.1
created_at	2025-10-01 04:20:47.870786+00
updated_at	2025-10-07 03:25:21.497278+00
description	Fast and scalable bitmap implementation based on Linux kernel's sbitmap
homepage
repository	https://github.com/ublk-org/sbitmap
max_upload_size
id	1862067
size	103,138

Ming Lei (ming1)

documentation

README

sbitmap - Scalable Bitmap Allocator

A fast and scalable lock-free bitmap implementation based on the Linux kernel's sbitmap.

Overview

sbitmap provides a high-performance, cache-line optimized bitmap for concurrent bit allocation across multiple threads. It's designed for scenarios where many threads need to allocate and free bits from a shared pool efficiently.

Features

Lock-free: All operations use atomic instructions without locks
Cache-line aligned: Each bitmap word is on its own cache line to prevent false sharing
Lightweight hints: Callers pass allocation hints by reference - no thread-local overhead
Scalable: Tested with high concurrency workloads
Memory efficient: Bit-level granularity with minimal overhead

Design

This implementation is based on the Linux kernel's sbitmap (from lib/sbitmap.c), specifically designed for:

High-concurrency scenarios (multiple queues, multiple threads)
Efficient resource allocation (journal entries, tags, etc.)
Low-latency allocation and deallocation

Key Optimizations

Cache-line separation: Each SbitmapWord is aligned to 64 bytes
Per-task allocation hints: Caller-provided hints reduce contention without thread-local overhead
Atomic operations: Acquire/Release semantics for correctness
No deferred clearing: Direct atomic bit clearing for simplicity

Usage

Add to your Cargo.toml:

[dependencies]
sbitmap = "0.1"

Basic Example

use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits (non-round-robin mode)
let sb = Sbitmap::new(1024, None, false);

// Each caller maintains its own allocation hint
let mut hint = 0;

// Allocate a bit
if let Some(bit) = sb.get(&mut hint) {
    // Use the allocated bit
    println!("Allocated bit: {}", bit);

    // Free it when done
    sb.put(bit, &mut hint);
}

Concurrent Usage

use sbitmap::Sbitmap;
use std::sync::Arc;
use std::thread;

let sb = Arc::new(Sbitmap::new(1024, None, false));
let mut handles = vec![];

for _ in 0..8 {
    let sb = Arc::clone(&sb);
    handles.push(thread::spawn(move || {
        // Each thread maintains its own hint in local context
        let mut hint = 0;

        // Each thread can safely allocate/free bits
        if let Some(bit) = sb.get(&mut hint) {
            // Do work...
            sb.put(bit, &mut hint);
        }
    }));
}

for h in handles {
    h.join().unwrap();
}

Batch Allocation

use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits
let sb = Sbitmap::new(1024, None, false);
let mut hint = 0;

// Allocate 4 consecutive bits atomically
if let Some(start_bit) = sb.get_batch(4, &mut hint) {
    // Use bits: start_bit, start_bit+1, start_bit+2, start_bit+3
    println!("Allocated bits {}-{}", start_bit, start_bit + 3);

    // Process consecutive resources...
    for i in 0..4 {
        println!("Using bit {}", start_bit + i);
    }

    // Free all 4 bits atomically when done
    sb.put_batch(start_bit, 4, &mut hint);
}

Note: Batch operations require nr_bits <= bits_per_word(). All consecutive bits are guaranteed to be within the same word (no spanning across word boundaries).

API

`Sbitmap::new(depth: usize, shift: Option<u32>, round_robin: bool) -> Self`

Create a new sbitmap with depth bits. The shift parameter controls how many bits per word (2^shift bits per word) and is critical for performance - it determines how bits are spread across multiple cache-line aligned words. When None, the shift is auto-calculated for optimal cache usage. The round_robin parameter enables strict round-robin allocation order (usually false for better performance).

Understanding the shift parameter:

The shift value spreads bits among multiple words, which is key to sbitmap performance
Each word is on a separate cache line (64 bytes), reducing contention between CPUs
Smaller shift = more words = better spreading = less contention (but more memory overhead)
Larger shift = fewer words = more contention (but better memory efficiency)

`get(&self, hint: &mut usize) -> Option<usize>`

Allocate a free bit. The hint parameter is a mutable reference to the caller's allocation hint, which helps reduce contention by spreading allocations across different parts of the bitmap. Returns Some(bit_number) on success or None if no free bits are available.

`put(&self, bitnr: usize, hint: &mut usize)`

Free a previously allocated bit. The hint parameter is updated to improve cache locality for subsequent allocations.

`get_batch(&self, nr_bits: usize, hint: &mut usize) -> Option<usize>`

Allocate nr_bits consecutive free bits from the bitmap atomically. This operation provides acquire barrier semantics on success. Only supports nr_bits <= bits_per_word() to ensure all bits are within the same word (no spanning across word boundaries).

Returns Some(start_bit) where start_bit is the first bit of the allocated consecutive range, or None if no consecutive nr_bits are available or nr_bits > bits_per_word().

Use cases:

Allocating contiguous resource ranges (e.g., multiple consecutive I/O tags)
Batch resource allocation for improved efficiency
DMA buffer allocation requiring consecutive indices

`put_batch(&self, bitnr: usize, nr_bits: usize, hint: &mut usize)`

Free nr_bits consecutive previously allocated bits starting from bitnr. This operation provides release barrier semantics, ensuring that all writes to data associated with these bits are visible before the bits are freed. Only supports nr_bits <= bits_per_word() to ensure all bits are within the same word.

The hint parameter is updated for better cache locality in subsequent allocations.

`test_bit(&self, bitnr: usize) -> bool`

Check if a bit is currently allocated.

`weight(&self) -> usize`

Count the number of currently allocated bits.

`depth(&self) -> usize`

Get the total number of bits in the bitmap.

Use Cases

Tag allocation: I/O tag allocation for block devices
Resource pools: Any scenario requiring efficient concurrent resource allocation
Lock-free data structures: Building block for concurrent algorithms
Batch resource allocation: Allocating multiple consecutive I/O tags, DMA buffers, or contiguous resource ranges
NUMA machine: improvement on NUMA machines is obvious

Performance Characteristics

Allocation: O(n) worst case, O(1) average with hints
Deallocation: O(1)
Batch allocation: O(n * nr_bits) worst case, finds consecutive bits within single word
Batch deallocation: O(1), atomic clear of consecutive bits
Memory overhead: ~56 bytes per word (64 bits) due to cache-line alignment
Thread safety: Lock-free with atomic operations
Scalability: Linear scaling with number of CPUs up to bitmap depth

Performance Tuning

The shift parameter is crucial for tuning sbitmap performance based on your workload:

When to use a smaller shift:

High contention: When many threads are competing heavily for bit allocation and release, use a smaller shift to spread bits across more words and reduce contention on individual cache lines
NUMA systems: Machines with multiple NUMA nodes benefit significantly from smaller shift values, as this distributes memory accesses across more cache lines and reduces cross-node traffic
Many concurrent allocators: Systems with a high CPU count see better scalability with smaller shift values

Examples:

// High contention scenario (32-core NUMA system)
let sb = Sbitmap::new(1024, Some(4), false);  // 2^4 = 16 bits per word, 64 words

// Low contention scenario (4-core system)
let sb = Sbitmap::new(1024, Some(6), false);  // 2^6 = 64 bits per word, 16 words

// Let sbitmap decide (recommended starting point)
let sb = Sbitmap::new(1024, None, false);     // Auto-calculated based on depth

Trade-offs:

Smaller shift improves performance under contention but uses more memory (each word needs 64 bytes for cache-line alignment)
Larger shift reduces memory overhead but increases contention when many threads compete
The auto-calculated shift (when None) provides a balanced default suitable for most workloads

Memory Ordering

get(): Acquire semantics - ensures allocated bit is visible before use
put(): Release semantics - ensures all writes complete before bit is freed
get_batch(): Acquire semantics - ensures all allocated bits are visible before use
put_batch(): Release semantics - ensures all writes complete before bits are freed

Comparison with Alternatives

Feature	sbitmap	Mutex + BitVec	AtomicBitSet
Lock-free	✅	❌	✅
Cache-optimized	✅	❌	❌
Per-thread hints	✅	❌	❌
Kernel-proven design	✅	❌	❌

Benchmarks

To compare sbitmap performance against a simple lockless bitmap:

# Run with defaults (32 bits, auto shift, 10 seconds, N-1 tasks)
cargo run --bin bench_compare --release

# Specify bitmap depth and duration
cargo run --bin bench_compare --release -- --depth 1024 --time 5

# Specify bitmap depth, shift, and duration
cargo run --bin bench_compare --release -- --depth 512 --shift 5 --time 10

# Benchmark batch operations (allocating 4 consecutive bits)
cargo run --bin bench_compare --release -- --depth 128 --batch 4 --time 5

# Show help
cargo run --bin bench_compare --release -- --help

This benchmark:

Auto-detects available CPUs and spawns N-1 concurrent tasks
Measures operations per second (get + put pairs for single-bit mode, get_batch + put_batch pairs for batch mode)
Compares sbitmap vs a baseline lockless implementation (single-bit mode only)
Defaults: 32 bits, auto-calculated shift, 10 seconds, N-1 tasks (where N is total CPU count)

Options:

--depth DEPTH - Bitmap depth in bits (default: 32)
--shift SHIFT - log2(bits per word), auto-calculated if not specified
--time TIME - Benchmark duration in seconds (default: 10)
--tasks TASKS - Number of concurrent tasks (default: NUM_CPUS - 1)
--batch NR_BITS - Use get_batch/put_batch with NR_BITS (default: 1, single bit mode)
--round-robin - Enable round-robin allocation mode (default: disabled)

See benches/README.md for more details.

Example output on a 32-CPU system:

System: 32 CPUs detected, 2 NUMA nodes, using 31 tasks for benchmark
Bitmap depth: 32 bits
Shift: auto-calculated (bits per word: 8)
Duration: 10 seconds


=== Sbitmap (Optimized) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 3101117 ops, 310111 ops/sec (0.3101 Mops/sec)
  ...
  Task 30: 3169582 ops, 316958 ops/sec (0.3170 Mops/sec)
  Total: 93604448 ops, 9360444 ops/sec (9.3604 Mops/sec)

=== SimpleBitmap (Baseline) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 1998241 ops, 199824 ops/sec (0.1998 Mops/sec)
  ...
  Task 30: 1835360 ops, 183536 ops/sec (0.1835 Mops/sec)
  Total: 62530560 ops, 6253056 ops/sec (6.2531 Mops/sec)

License

Licensed under either of:

MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)

at your option.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Credits

Based on the Linux kernel's sbitmap implementation by Jens Axboe and other contributors.

Commit count: 0