swarc

Crates.ioswarc
lib.rsswarc
version0.1.0
created_at2025-09-30 21:29:27.187608+00
updated_at2025-09-30 21:29:27.187608+00
descriptionSmall World Approximate Recall Crate - A high-performance HNSW implementation in Rust
homepage
repositoryhttps://github.com/carlosbertoncelli/swarc
max_upload_size
id1861758
size149,091
Carlos Antonio Bertoncelli Júnior (carlosbertoncelli)

documentation

README

SWARC - Small World Approximate Recall Crate

A high-performance implementation of the Hierarchical Navigable Small World (HNSW) algorithm in Rust. SWARC provides state-of-the-art approximate nearest neighbor search with excellent performance for high-dimensional vector similarity search.

Features

  • Fast Approximate Nearest Neighbor Search: Efficient k-NN search with logarithmic time complexity
  • Document Linking: Associate embeddings with external data using a flexible document structure
  • Dynamic Operations: Insert, remove, and rebalance the index at runtime
  • Modular Architecture: Clean separation of concerns with dedicated modules for different operations
  • Type Safety: Generic implementation that works with any data type
  • Serialization Support: Built-in support for JSON serialization/deserialization

Algorithm Overview

HNSW constructs a multi-layer graph where:

  • Higher layers contain fewer nodes and provide long-range connections
  • Lower layers contain more nodes and provide fine-grained search
  • Search starts from the top layer and navigates down to find nearest neighbors
  • Insertion uses probabilistic level assignment and intelligent neighbor selection

Installation

Add this to your Cargo.toml:

[dependencies]
swarc = { path = "path/to/swarc" }

Or if using from crates.io (when published):

[dependencies]
swarc = "0.1.0"

Quick Start

use swarc::{HNSWIndex, Document};
use rand::Rng;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a new HNSW index
    // Parameters: dimension, max_connections, ef_construction
    let mut index = HNSWIndex::new(128, 16, 200);
    
    // Create a document with external data
    let document = Document {
        id: "doc1".to_string(),
        data: "This is my document content".to_string(),
    };
    
    // Generate a random embedding (in practice, use your embedding model)
    let mut rng = rand::thread_rng();
    let embedding: Vec<f32> = (0..128).map(|_| rng.gen_range(-1.0..1.0)).collect();
    
    // Insert the document into the index
    index.insert("node1".to_string(), embedding, Some(document))?;
    
    // Search for nearest neighbors
    let query: Vec<f32> = (0..128).map(|_| rng.gen_range(-1.0..1.0)).collect();
    let results = index.search(&query, 5); // Find 5 nearest neighbors
    
    for (id, distance, document) in results {
        println!("ID: {}, Distance: {:.4}, Document: {:?}", id, distance, document);
    }
    
    // Remove a document
    let removed_doc = index.remove("node1")?;
    println!("Removed: {:?}", removed_doc);
    
    Ok(())
}

API Reference

Core Types

Document<T>

pub struct Document<T> {
    pub id: String,
    pub data: T,
}

A wrapper for external data associated with embeddings.

HNSWIndex<T>

The main index structure that provides all HNSW operations.

Main Methods

new(dim: usize, m: usize, ef_construction: usize) -> HNSWIndex<T>

Creates a new HNSW index.

  • dim: Dimensionality of the embedding vectors
  • m: Maximum number of connections per node (except layer 0)
  • ef_construction: Size of dynamic candidate list during construction

insert(id: String, embedding: Vec<f32>, document: Option<Document<T>>) -> Result<(), String>

Inserts a new node into the index.

  • id: Unique identifier for the node
  • embedding: The vector embedding
  • document: Optional associated document data

search(query: &[f32], k: usize) -> Vec<(String, f32, Option<&Document<T>>)>

Searches for k nearest neighbors.

  • query: The query vector
  • k: Number of nearest neighbors to return
  • Returns: Vector of (node_id, distance, document) tuples

remove(id: &str) -> Result<Option<Document<T>>, String>

Removes a node from the index.

  • id: The node identifier to remove
  • Returns: The removed document if it existed

rebalance() -> Result<(), String>

Rebalances the index structure (currently a placeholder for future enhancements).

Utility Methods

  • len() -> usize: Get the number of nodes in the index
  • is_empty() -> bool: Check if the index is empty
  • get_node(id: &str) -> Option<&HNSWNode<T>>: Get a node by ID
  • get_all_ids() -> Vec<String>: Get all node IDs
  • contains(id: &str) -> bool: Check if a node exists
  • clear(): Remove all nodes from the index
  • remove_multiple(ids: &[&str]) -> Result<Vec<Option<Document<T>>>, String>: Remove multiple nodes

Architecture

The implementation is organized into several modules:

  • types.rs: Core data structures (Document, HNSWNode)
  • index.rs: Main index structure and basic operations
  • insert.rs: Insertion and rebalancing logic
  • search.rs: Search and nearest neighbor algorithms
  • remove.rs: Node removal and cleanup operations

Performance Characteristics

  • Search Complexity: O(log N) for approximate nearest neighbor search
  • Insertion Complexity: O(log N) for adding new nodes
  • Memory Usage: O(N × M) where N is the number of nodes and M is the average connections per node
  • Distance Metric: Currently uses Euclidean distance (L2 norm)

Performance Benchmarks

SWARC has been benchmarked with high-dimensional embeddings (3072 dimensions) across various dataset sizes. See the performance tests directory for detailed benchmarks and visualizations.

Key Performance Metrics:

  • Insertion Throughput: Up to millions of embeddings per hour
  • Search Time: Sub-millisecond query times for large datasets
  • Memory Efficiency: Linear scaling with dataset size
  • Scalability: Tested up to 5 million 3072-dimensional embeddings

Insertion Performance Insertion time scales logarithmically with dataset size

Search Performance Search time remains relatively constant across dataset sizes

Memory Usage Memory usage scales linearly with dataset size

Throughput Insertion throughput across different dataset sizes

Configuration Parameters

m (Maximum Connections)

  • Controls the maximum number of connections per node
  • Higher values: Better recall, more memory usage, slower search
  • Lower values: Faster search, less memory, potentially lower recall
  • Typical range: 8-32

ef_construction (Construction Parameter)

  • Size of dynamic candidate list during index construction
  • Higher values: Better index quality, slower construction
  • Lower values: Faster construction, potentially lower search quality
  • Typical range: 100-500

max_layers

  • Automatically calculated based on expected dataset size
  • Higher layers provide long-range navigation
  • Lower layers provide fine-grained search

Examples

Basic Usage

let mut index = HNSWIndex::new(128, 16, 200);
index.insert("node1".to_string(), vec![0.1, 0.2, 0.3], None)?;
let results = index.search(&vec![0.1, 0.2, 0.3], 1);

Performance Testing

# Run benchmarks with millions of 3072-dimensional embeddings
cargo run --bin benchmark

# Generate performance plots
cargo run --bin plot_results

With Documents

let doc = Document {
    id: "article_1".to_string(),
    data: Article { title: "AI Research", content: "..." },
};
index.insert("node1".to_string(), embedding, Some(doc))?;

Batch Operations

// Insert multiple documents
for (i, embedding) in embeddings.iter().enumerate() {
    let doc = Document {
        id: format!("doc_{}", i),
        data: documents[i].clone(),
    };
    index.insert(format!("node_{}", i), embedding.clone(), Some(doc))?;
}

// Remove multiple documents
let ids = ["node_1", "node_2", "node_3"];
let removed = index.remove_multiple(&ids)?;

Limitations and Future Work

Current Limitations

  • Only supports Euclidean distance (L2 norm)
  • Rebalancing is a placeholder implementation
  • No persistence/serialization of the index structure
  • No support for different distance metrics

Planned Enhancements

  • Support for multiple distance metrics (cosine, inner product, etc.)
  • Index persistence and loading
  • Advanced rebalancing strategies
  • Parallel search and insertion
  • Memory-mapped storage for large datasets
  • Benchmarking and performance optimization tools

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Commit count: 0

cargo fmt