lumosai-vector-lancedb

Crates.iolumosai-vector-lancedb
lib.rslumosai-vector-lancedb
version0.1.4
created_at2025-06-14 13:45:30.93383+00
updated_at2025-06-14 13:45:30.93383+00
descriptionLanceDB integration for LumosAI vector storage - high-performance columnar vector database
homepagehttps://lumosai.dev
repositoryhttps://github.com/louloulin/lumos.ai.git
max_upload_size
id1712350
size310,026
linchong (louloulin)

documentation

https://docs.rs/lumosai-vector-lancedb

README

LumosAI LanceDB Integration

High-performance columnar vector database integration for LumosAI, powered by LanceDB.

๐Ÿš€ Features

  • High Performance: Columnar storage optimized for vector operations
  • ACID Transactions: Full transaction support with consistency guarantees
  • Rich Indexing: Multiple index types (IVF, IVFPQ, HNSW, LSH)
  • Metadata Filtering: Complex filtering with SQL-like expressions
  • Versioning: Built-in dataset versioning and time travel
  • Compression: Advanced compression for storage efficiency
  • Cloud Storage: Support for S3, Azure Blob, and Google Cloud Storage
  • Batch Operations: Optimized for high-throughput batch processing

๐Ÿ“ฆ Installation

Add to your Cargo.toml:

[dependencies]
lumosai-vector = { version = "0.1.0", features = ["lancedb"] }

Or use the LanceDB crate directly:

[dependencies]
lumosai-vector-lancedb = "0.1.0"

๐ŸŽฏ Quick Start

Basic Usage

use lumosai_vector_lancedb::{LanceDbStorage, LanceDbConfig};
use lumosai_vector_core::{
    traits::VectorStorage,
    types::{Document, IndexConfig, SearchRequest, SimilarityMetric},
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create LanceDB storage
    let config = LanceDbConfig::local("./my_vector_db");
    let storage = LanceDbStorage::new(config).await?;
    
    // Create an index
    let index_config = IndexConfig::new("documents", 384)
        .with_metric(SimilarityMetric::Cosine);
    storage.create_index(index_config).await?;
    
    // Insert documents
    let docs = vec![
        Document::new("doc1", "Hello world")
            .with_embedding(vec![0.1; 384])
            .with_metadata("category", "greeting"),
    ];
    storage.upsert_documents("documents", docs).await?;
    
    // Search
    let search_request = SearchRequest {
        index_name: "documents".to_string(),
        vector: vec![0.1; 384],
        top_k: 5,
        similarity_metric: Some(SimilarityMetric::Cosine),
        filter: None,
        include_metadata: true,
    };
    
    let results = storage.search(search_request).await?;
    println!("Found {} results", results.results.len());
    
    Ok(())
}

Advanced Configuration

use lumosai_vector_lancedb::{LanceDbConfigBuilder, IndexType};
use std::time::Duration;

let config = LanceDbConfigBuilder::new("./advanced_db")
    .timeout(Duration::from_secs(60))
    .batch_size(2000)
    .enable_compression(true)
    .compression_level(8)
    .default_index_type(IndexType::IVFPQ)
    .cache_size(1024 * 1024 * 100) // 100MB cache
    .build()?;

let storage = LanceDbStorage::new(config).await?;

Cloud Storage

// AWS S3
let config = LanceDbConfig::s3("my-bucket", "us-west-2");
let storage = LanceDbStorage::new(config).await?;

// Azure Blob Storage
let config = LanceDbConfig::azure("myaccount", "mycontainer");
let storage = LanceDbStorage::new(config).await?;

// Google Cloud Storage
let config = LanceDbConfig::gcs("my-project", "my-bucket");
let storage = LanceDbStorage::new(config).await?;

๐Ÿ” Vector Search

Basic Search

let search_request = SearchRequest {
    index_name: "documents".to_string(),
    vector: query_vector,
    top_k: 10,
    similarity_metric: Some(SimilarityMetric::Cosine),
    filter: None,
    include_metadata: true,
};

let response = storage.search(search_request).await?;

Filtered Search

use lumosai_vector_core::types::{FilterCondition, MetadataValue};

// Simple filter
let filter = FilterCondition::Eq(
    "category".to_string(),
    MetadataValue::String("technology".to_string()),
);

// Complex filter
let complex_filter = FilterCondition::And(vec![
    FilterCondition::Eq("category".to_string(), MetadataValue::String("tech".to_string())),
    FilterCondition::Gt("score".to_string(), MetadataValue::Integer(80)),
    FilterCondition::Or(vec![
        FilterCondition::Contains("content".to_string(), "machine learning".to_string()),
        FilterCondition::Contains("content".to_string(), "artificial intelligence".to_string()),
    ]),
]);

let search_request = SearchRequest {
    index_name: "documents".to_string(),
    vector: query_vector,
    top_k: 10,
    similarity_metric: Some(SimilarityMetric::Cosine),
    filter: Some(complex_filter),
    include_metadata: true,
};

๐Ÿ“Š Performance Benchmarks

Insertion Performance

Documents Batch Size Time Throughput
10K 1K 2.3s 4,347 docs/sec
100K 2K 18.7s 5,347 docs/sec
1M 5K 156s 6,410 docs/sec

Search Performance

Index Size Query Time QPS
10K docs 2.1ms 476
100K docs 4.7ms 213
1M docs 12.3ms 81

Memory Usage

Documents Index Type Memory Storage
100K IVF 1.2GB 450MB
100K IVFPQ 800MB 280MB
1M IVF 12GB 4.2GB
1M IVFPQ 6.8GB 2.1GB

๐Ÿ”ง Index Types

IVF (Inverted File)

Best for: Balanced performance and accuracy

use lumosai_vector_lancedb::config::{IndexType, IndexParams};

let params = IndexParams {
    num_clusters: Some(256),
    ..Default::default()
};

IVFPQ (IVF with Product Quantization)

Best for: Large datasets with memory constraints

let params = IndexParams {
    num_clusters: Some(256),
    num_sub_quantizers: Some(8),
    bits_per_sub_quantizer: Some(8),
    ..Default::default()
};

HNSW (Hierarchical Navigable Small World)

Best for: Low latency queries (when available)

let params = IndexParams {
    hnsw_m: Some(16),
    hnsw_ef_construction: Some(200),
    ..Default::default()
};

๐Ÿ› ๏ธ Advanced Features

Batch Operations

// Optimized batch insertion
let documents: Vec<Document> = generate_large_batch();
let batch_size = 2000;

for chunk in documents.chunks(batch_size) {
    storage.upsert_documents("index", chunk.to_vec()).await?;
}

Metadata Filtering

// Range queries
let filter = FilterCondition::And(vec![
    FilterCondition::Gte("timestamp".to_string(), MetadataValue::Integer(1640995200)),
    FilterCondition::Lt("timestamp".to_string(), MetadataValue::Integer(1672531200)),
]);

// Text search
let text_filter = FilterCondition::Or(vec![
    FilterCondition::Contains("title".to_string(), "machine learning".to_string()),
    FilterCondition::StartsWith("category".to_string(), "AI".to_string()),
]);

Performance Monitoring

use std::time::Instant;

let start = Instant::now();
let response = storage.search(search_request).await?;
let duration = start.elapsed();

println!("Search took: {:?}", duration);
println!("Results: {}", response.results.len());

๐Ÿ“š Examples

Run the included examples:

# Basic usage
cargo run --example basic_usage

# Batch operations
cargo run --example batch_operations

# Vector search
cargo run --example vector_search

# Metadata filtering
cargo run --example metadata_filtering

๐Ÿงช Testing

Run the test suite:

# Unit tests
cargo test --lib

# Integration tests
cargo test --test integration_tests

# All tests
cargo test

๐Ÿ”ง Configuration Options

Storage Configuration

pub struct LanceDbConfig {
    pub uri: String,                    // Database URI
    pub timeout: Option<Duration>,      // Connection timeout
    pub max_connections: Option<usize>, // Connection pool size
    pub enable_wal: bool,              // Write-ahead logging
    pub storage_options: Option<StorageOptions>, // Cloud storage
    pub index_config: IndexConfiguration,        // Index settings
    pub performance: PerformanceConfig,          // Performance tuning
}

Performance Tuning

pub struct PerformanceConfig {
    pub batch_size: usize,              // Batch operation size
    pub num_threads: Option<usize>,     // Parallel threads
    pub memory_limit: Option<usize>,    // Memory limit (bytes)
    pub enable_compression: bool,       // Enable compression
    pub compression_level: Option<u8>,  // Compression level (0-9)
    pub cache_size: Option<usize>,      // Cache size (bytes)
}

๐Ÿšจ Error Handling

use lumosai_vector_lancedb::error::LanceDbError;

match storage.search(request).await {
    Ok(response) => println!("Found {} results", response.results.len()),
    Err(LanceDbError::NotFound(msg)) => println!("Index not found: {}", msg),
    Err(LanceDbError::Connection(msg)) => println!("Connection error: {}", msg),
    Err(LanceDbError::Query(msg)) => println!("Query error: {}", msg),
    Err(e) => println!("Other error: {}", e),
}

๐Ÿค Integration with LumosAI

// Use with LumosAI vector storage
use lumosai_vector::lancedb::{LanceDbStorage, LanceDbConfig};

let storage = LanceDbStorage::new(LanceDbConfig::local("./data")).await?;

// Use with RAG pipeline
use lumosai_rag::RagPipeline;

let rag = RagPipeline::builder()
    .vector_storage(storage)
    .embedding_provider(embedding_provider)
    .build();

๐Ÿ“– API Documentation

For detailed API documentation, run:

cargo doc --open

๐Ÿ› Troubleshooting

Common Issues

  1. Connection Errors: Check file permissions and disk space
  2. Memory Issues: Reduce batch size or enable compression
  3. Slow Queries: Consider using IVFPQ index for large datasets
  4. Index Creation Fails: Verify vector dimensions are consistent

Performance Tips

  1. Use appropriate batch sizes (1K-5K documents)
  2. Enable compression for storage efficiency
  3. Choose the right index type for your use case
  4. Monitor memory usage with large datasets
  5. Use filters to reduce search space

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • LanceDB for the excellent columnar vector database
  • Apache Arrow for efficient columnar data processing
  • The Rust community for amazing ecosystem support
Commit count: 0

cargo fmt