dataspool-rs

Crates.iodataspool-rs
lib.rsdataspool-rs
version0.2.0
created_at2025-12-22 21:48:18.754223+00
updated_at2026-01-08 05:28:54.275771+00
descriptionEfficient data bundling system with indexed .spool files and SQLite vector database
homepage
repositoryhttps://github.com/Blackfall-Labs/dataspool-rs
max_upload_size
id2000320
size49,747
Magnus Trent (magnus-trent)

documentation

https://docs.rs/dataspool-rs

README

DataSpool - Efficient Data Bundling System

DataSpool is a high-performance data bundling library that eliminates filesystem overhead by concatenating multiple items (cards, images, binary blobs) into a single indexed .spool file with SQLite-based metadata and vector embeddings.

Features

  • πŸ“¦ Efficient Bundling - Single file storage with byte-offset index
  • πŸš€ Random Access - Direct seeks to any item without scanning
  • πŸ” Vector Search - SQLite-backed embeddings for semantic retrieval
  • πŸ“Š Metadata Storage - Rich metadata with full-text search (FTS5)
  • πŸ”„ Multiple Variants - Cards (compressed CML), images, binary blobs
  • πŸ’Ύ Compact Format - Minimal overhead, optimal for thousands of items
  • πŸ” Type-Safe - Rust type safety with serde serialization

Quick Start

Writing a Spool

use dataspool::{SpoolBuilder, SpoolEntry};

// Create spool builder
let mut builder = SpoolBuilder::new();

// Add entries
builder.add_entry(SpoolEntry {
    id: "item1".to_string(),
    data: b"Item 1 data".to_vec(),
});

builder.add_entry(SpoolEntry {
    id: "item2".to_string(),
    data: b"Item 2 data".to_vec(),
});

// Write to file
builder.write_to_file("data.spool")?;

Reading from a Spool

use dataspool::SpoolReader;

// Open spool
let reader = SpoolReader::open("data.spool")?;

// Read specific entry
let data = reader.read_entry(0)?; // Read first entry
println!("Item 0: {} bytes", data.len());

// Iterate entries
for (index, entry) in reader.iter_entries().enumerate() {
    let data = entry?;
    println!("Item {}: {} bytes", index, data.len());
}

Persistent Vector Store

use dataspool::{PersistentVectorStore, DocumentRef};

// Create persistent store
let mut store = PersistentVectorStore::new("vectors.db")?;

// Add document with embedding
let doc_ref = DocumentRef {
    id: "doc1".to_string(),
    file_path: "data.spool".to_string(),
    source: "web-scrape".to_string(),
    metadata: Some(r#"{"title": "Example"}"#.to_string()),
    spool_offset: Some(0),
    spool_length: Some(1024),
};

let embedding = vec![0.1, 0.2, 0.3, 0.4]; // Example embedding vector
store.add_document_ref(&doc_ref, &embedding)?;

// Search by vector similarity
let query_vector = vec![0.15, 0.25, 0.35, 0.45];
let results = store.search(&query_vector, 10)?;

for result in results {
    println!("ID: {}, Score: {:.3}", result.id, result.score);
}

Spool Format

File Structure

.spool file:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Magic: "SP01"       (4 bytes)β”‚
β”‚ Version: 1          (1 byte) β”‚
β”‚ Card Count          (4 bytes)β”‚
β”‚ Index Offset        (8 bytes)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Card 0 Data                  β”‚
β”‚ Card 1 Data                  β”‚
β”‚ ...                          β”‚
β”‚ Card N Data                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Index:                       β”‚
β”‚   [offset0, len0]            β”‚
β”‚   [offset1, len1]            β”‚
β”‚   ...                        β”‚
β”‚   [offsetN, lenN]            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

.db file (SQLite):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ documents table:             β”‚
β”‚   - id                       β”‚
β”‚   - file_path                β”‚
β”‚   - source                   β”‚
β”‚   - metadata (JSON)          β”‚
β”‚   - spool_offset             β”‚
β”‚   - spool_length             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ embeddings table:            β”‚
β”‚   - doc_id                   β”‚
β”‚   - vector (BLOB)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Format Details

  • Magic Number: SP01 (4 bytes) - Identifies spool format
  • Version: 1 (1 byte) - Format version
  • Card Count: Number of entries in spool (u32)
  • Index Offset: Byte offset where index starts (u64)
  • Index: Array of [offset: u64, length: u64] pairs

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DataCard   β”‚ (compressed CML)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SpoolBuilder│────>β”‚  .spool file β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           v
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ SpoolReader  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       v                                       v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PersistentVector β”‚                  β”‚  .db (SQLite)  β”‚
β”‚      Store       β”‚<─────────────────│  - documents   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚  - embeddings  β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Use Cases

1. Knowledge Base Archival

Bundle thousands of documentation cards into a single file:

// Build spool from cards
let mut builder = SpoolBuilder::new();
for card in documentation_cards {
    builder.add_entry(SpoolEntry {
        id: card.id,
        data: card.compressed_data,
    });
}
builder.write_to_file("rust-stdlib.spool")?;

// Create vector index
let mut store = PersistentVectorStore::new("rust-stdlib.db")?;
for (i, embedding) in embeddings.iter().enumerate() {
    store.add_document_ref(&DocumentRef {
        id: format!("card_{}", i),
        file_path: "rust-stdlib.spool".to_string(),
        spool_offset: Some(offsets[i]),
        spool_length: Some(lengths[i]),
        ...
    }, embedding)?;
}

2. Image Dataset Storage

Store image collections with metadata:

let mut builder = SpoolBuilder::new();

for image_path in image_paths {
    let data = std::fs::read(&image_path)?;
    builder.add_entry(SpoolEntry {
        id: image_path.file_stem().unwrap().to_string(),
        data,
    });
}

builder.write_to_file("images.spool")?;

3. Binary Blob Archival

Archive arbitrary binary data with fast random access:

// Write blobs
let mut builder = SpoolBuilder::new();
builder.add_entry(SpoolEntry { id: "blob1".into(), data: blob1 });
builder.add_entry(SpoolEntry { id: "blob2".into(), data: blob2 });
builder.write_to_file("blobs.spool")?;

// Random access read
let reader = SpoolReader::open("blobs.spool")?;
let blob1_data = reader.read_entry(0)?; // Direct access, no scan

Performance

Benchmark results (3,309 items, Rust stdlib documentation):

Operation Time Notes
Build spool ~200ms Writing all items + index
Read single item <1ms Direct byte offset seek
Read all items ~50ms Sequential read
SQLite insert (1 doc) ~0.5ms With embedding
Vector search (10 results) ~5ms Cosine similarity + index

Comparison to Alternatives

Approach Read Speed Storage Overhead Random Access
Individual files Slow (3,309 inodes) High (4KB/file) Yes
tar archive Slow (must scan) Low No
zip archive Fast Medium Yes
DataSpool Fast Minimal Yes

DataSpool Advantages

  • No compression overhead - Items pre-compressed by BytePunch
  • Instant random access - Direct byte offset, no central directory scan
  • Integrated vector DB - Semantic search without external tools
  • Minimal format - Simple binary format, easy to parse

Dependencies

[dependencies]
dataspool = "0.1.0"
bytepunch = "0.1.0"  # For compressed item decompression

Dependency Graph

dataspool
β”œβ”€β”€ bytepunch (compression)
β”œβ”€β”€ rusqlite (SQLite database)
β”œβ”€β”€ serde (serialization)
└── thiserror (error handling)

Features

Default

Basic spool read/write and persistent vector store.

Optional: async

Async APIs for non-blocking I/O:

[dependencies]
dataspool = { version = "0.1.0", features = ["async"] }
use dataspool::async_api::AsyncSpoolReader;

let reader = AsyncSpoolReader::open("data.spool").await?;
let data = reader.read_entry(0).await?;

Installation

Add to Cargo.toml:

[dependencies]
dataspool = "0.1.0"

Or with async support:

[dependencies]
dataspool = { version = "0.1.0", features = ["async"] }

Testing

# Run all tests
cargo test

# Run with logging
RUST_LOG=debug cargo test

# Test specific module
cargo test spool
cargo test persistent_store

Examples

See examples/ directory:

  • build_spool.rs - Build a spool from files
  • read_spool.rs - Read entries from a spool
  • vector_search.rs - Semantic search with embeddings

Run with:

cargo run --example build_spool
cargo run --example read_spool
cargo run --example vector_search

Roadmap

  • Image-based spools with EXIF metadata
  • Audio/video spool variants
  • Compression statistics per entry
  • Incremental spool updates (append-only mode)
  • Multi-threaded indexing
  • Memory-mapped I/O for large spools
  • Network streaming protocol

History

Extracted from the SAM (Societal Advisory Module) project, where it provides the spool bundling system for knowledge base archival.

License

MIT - See LICENSE for details.

Author

Magnus Trent magnus@blackfall.dev

Links

Commit count: 0

cargo fmt