casq_core

Crates.iocasq_core
lib.rscasq_core
version0.8.2
created_at2026-01-22 21:45:55.370508+00
updated_at2026-01-25 18:51:26.05737+00
descriptionA minimal content-addressed file store using BLAKE3. (library)
homepage
repositoryhttps://github.com/roobie/casq
max_upload_size
id2062683
size189,888
Björn Roberg (roobie)

documentation

README

casq_core

A production-ready content-addressed file store (CAS) library with compression and chunking (v0.4.0).

Overview

casq_core is a Rust library that provides the core functionality for casq, a content-addressed storage system with modern efficiency features. It stores files and directories by their cryptographic hash, ensuring immutable, deduplicated storage with transparent compression, content-defined chunking, and built-in garbage collection.

Think of it as a minimal git object store or restic backend, but with compression and chunking for storage efficiency.

Features

  • Content-Addressed Storage - Files and directories stored by BLAKE3 hash
  • Transparent Compression - 3-5x storage reduction with zstd (files ≥ 4KB)
  • Content-Defined Chunking - FastCDC for incremental backups (files ≥ 1MB)
  • Automatic Deduplication - Identical content stored only once (including chunk-level)
  • Tree-Based Directories - Canonical ordering ensures stable hashes
  • Atomic Operations - Tempfile-based writes prevent corruption
  • Garbage Collection - Mark & sweep algorithm handles all object types
  • Corruption Detection - Hash verification on all reads
  • Named References - GC roots for preserving important snapshots
  • Full Round-Trip - Add → Store → GC → Materialize
  • Cross-Platform - Unix permissions preserved, Windows supported
  • Gitignore Support - Respects .gitignore during filesystem walks

Quick Start

use casq_core::{Store, Algorithm};
use std::path::Path;

// Initialize a new store
let store = Store::init("./my-store", Algorithm::Blake3)?;

// Add a file or directory
let hash = store.add_path(Path::new("./my-data"))?;

// Create a named reference (GC root)
store.refs().add("backup-2024", &hash)?;

// Garbage collect unreferenced objects
let stats = store.gc(false)?;
println!("Deleted {} objects, freed {} bytes",
         stats.objects_deleted, stats.bytes_freed);

// Materialize back to filesystem
store.materialize(&hash, Path::new("./restored"))?;

Architecture

Storage Format

Objects are stored with a 16-byte header followed by the payload.

Object format:

0x00  4   "CAFS" magic
0x04  1   version (u8) = 2
0x05  1   type: 1=blob, 2=tree, 3=chunk_list
0x06  1   algo: 1=blake3-256
0x07  1   compression: 0=none, 1=zstd
0x08  8   payload_len (u64 LE) - compressed size if compressed
0x10  ... payload (possibly compressed)

Directory Structure

$STORE_ROOT/
  config               # Store configuration (version, algorithm)
  objects/
    blake3-256/        # Algorithm-specific directory
      ab/              # First 2 hex chars (shard)
        abcd...ef      # Remaining 62 hex chars (object file)
  refs/                # Named references (GC roots)
    backup-name

Object Types

  1. Blob - Raw file content (automatically compressed if ≥ 4KB, hash of uncompressed payload)
  2. Tree - Directory structure (sorted entries by name for canonical hashing)
  3. ChunkList - Large file metadata (files ≥ 1MB split into variable-size chunks using FastCDC)

Module Structure

casq_core/src/
├── lib.rs       - Public API and documentation
├── error.rs     - Error types with thiserror
├── hash.rs      - BLAKE3 hashing (32-byte digests)
├── object.rs    - Binary object encoding/decoding
├── chunking.rs  - Content-defined chunking with FastCDC (v0.4.0+)
├── store.rs     - Store management with compression/chunking
├── tree.rs      - Tree entry encoding with canonical sorting
├── walk.rs      - Filesystem traversal with gitignore
├── gc.rs        - Garbage collection (mark & sweep, handles all object types)
├── refs.rs      - Reference management
└── journal.rs   - Operation journal

Building and Testing

# Build the library
cargo build --release -p casq_core

# Run all tests (121 unit tests + 23 property tests + 1 doctest)
cargo test -p casq_core

# Run only property tests
cargo test -p casq_core prop

# Run with output
cargo test -p casq_core -- --nocapture

# Check code quality
cargo clippy -p casq_core -- -D warnings

# Format code
cargo fmt -p casq_core

API Overview

Core Types

  • Store - Main store interface
  • Hash - 32-byte BLAKE3 hash wrapper
  • TreeEntry - File/directory entry in a tree
  • RefManager - Manages named references
  • GcStats - Garbage collection statistics

Main Operations

// Store initialization
let store = Store::init(path, Algorithm::Blake3)?;
let store = Store::open(path)?;

// Object storage
let hash = store.put_blob(reader)?;
let hash = store.put_tree(entries)?;
let hash = store.add_path(path)?;  // Recursively add file/dir

// Object retrieval
let data = store.get_blob(&hash)?;
let entries = store.get_tree(&hash)?;
store.cat_blob(&hash, writer)?;

// Materialization
store.materialize(&hash, dest_path)?;

// References
store.refs().add(name, &hash)?;
let hash = store.refs().get(name)?;
let all_refs = store.refs().list()?;
store.refs().remove(name)?;

// Garbage collection
let stats = store.gc(dry_run)?;

Design Principles

  1. Content-Addressed - Objects are immutable and identified by hash
  2. Canonical Hashing - Tree entries sorted by name for stable hashes
  3. Transparent Optimization - Compression and chunking automatic, invisible to API consumers
  4. Atomic Writes - Use tempfile for corruption-free operations
  5. Simple Format - Binary format with clear headers, human-inspectable paths
  6. Efficient Storage - 3-5x compression typical, incremental backups via chunking
  7. Local-Only - Single-user design, no network features

Hashing Rules

  • Blob hash: hash = blake3(uncompressed_payload_bytes) (payload only, not header)
  • ChunkList hash: hash = blake3(original_file_bytes) (not the ChunkList metadata)
  • Tree hash: Hash of canonicalized entries (sorted by name, bytewise UTF-8)
  • Object path: objects/<algo>/<prefix>/<suffix> where prefix is first 2 hex chars
  • Important: Hashes are stable regardless of compression/chunking

Garbage Collection

  • Refs are GC roots stored in refs/ directory
  • Mark phase traverses from all refs, recursively following tree entries
  • Sweep phase deletes objects not in the reachable set
  • Dry-run mode available for safe preview before deletion

Test Coverage

✓ 121 unit tests passing (100% pass rate)
✓ 23 property tests (generative invariant verification)
✓ 1 doctest passing
✓ 100% core functionality coverage
✓ Edge cases: corruption, empty files/dirs, large files, permissions
✓ Round-trip testing: add → store → materialize → verify
✓ Compression/chunking: thresholds, boundaries, deduplication

Test Categories

Unit Tests:

  • Hash operations - Encoding, decoding, validation
  • Object encoding - Headers, payload, compression types
  • Chunking - FastCDC boundaries, deterministic chunking, small files
  • Store operations - Init, open, blob/tree/chunklist storage, compression
  • Tree operations - Canonical ordering, nested structures
  • Filesystem walking - Files, directories, permissions, gitignore
  • References - CRUD operations, validation
  • Garbage collection - Mark, sweep, dry-run, all object types including chunks
  • Materialization - Blobs, trees, chunked files, nested structures, permissions

Property Tests:

  • Hash determinism - Hashing same data always produces same result
  • Serialization round-trips - All binary formats (headers, chunks, trees)
  • Compression identity - Compress/decompress preserves data
  • Chunking invariants - Size bounds, determinism, total size preservation
  • Tree canonicalization - Order-independent hashing
  • GC correctness - Preserves referenced, deletes unreferenced, idempotent
  • Ref validation - Valid names accepted

Limitations (By Design)

The following are intentionally not supported:

  • ❌ Network operations (remote stores)
  • ❌ Multi-user/concurrent access
  • ❌ Encryption (planned for future)
  • ❌ Parallel operations (single-threaded)
  • ❌ Symbolic links
  • ❌ Special file types (devices, sockets, etc.)
  • ❌ Extended attributes or ACLs beyond basic POSIX permissions

Note: Compression and chunking are now supported in v0.4.0+

Performance Characteristics

  • Hash algorithm: BLAKE3 (fast, cryptographically secure)
  • Compression: Zstd level 3 (~500 MB/s compression/decompression)
  • Chunking: FastCDC (~1 GB/s processing)
  • I/O: Streaming for large files (no full buffering)
  • Deduplication: Automatic via content addressing (including chunk-level)
  • GC: Mark & sweep with efficient hash set operations (handles all object types)
  • Directory sharding: First 2 hex chars prevent filesystem bottlenecks

Storage Efficiency:

  • Compression: 3-5x reduction for text files, 2-3x for mixed data
  • Chunking: Change 1 byte in 1GB file → store only ~512KB (changed chunk)
  • Cross-file deduplication: Shared content stored only once

Error Handling

All operations return Result<T, Error> with detailed error types:

  • IoError - File system operations
  • CorruptedObject - Hash mismatch or invalid format
  • InvalidHash - Malformed hash string
  • ObjectNotFound - Missing object in store
  • InvalidStore - Store not initialized or corrupted config
  • InvalidRef - Bad reference name or format
  • PathExists - Destination already exists (materialization)

Dependencies

blake3 = "1.5"      # BLAKE3 hashing
hex = "0.4"         # Hash hex encoding/decoding
tempfile = "3.0"    # Atomic object writes
ignore = "0.4"      # Filesystem walking with .gitignore support
thiserror = "2.0"   # Error handling
zstd = "0.13"       # Transparent compression (v0.4.0+)
fastcdc = "3.1"     # Content-defined chunking (v0.4.0+)

Contributing

This library is part of the casq project. When contributing:

  1. Ensure all tests pass: cargo test -p casq_core
  2. Maintain clippy cleanliness: cargo clippy -p casq_core -- -D warnings
  3. Format code: cargo fmt -p casq_core
  4. Add tests for new functionality
  5. Update documentation

License

Apache-2.0

See Also

  • casq - CLI binary using this library
  • NOTES.md - Detailed design and specification
  • CLAUDE.md - Development guidelines for AI assistants
Commit count: 68

cargo fmt