| Crates.io | distx-similarity |
| lib.rs | distx-similarity |
| version | 0.2.5 |
| created_at | 2025-12-19 19:23:09.770927+00 |
| updated_at | 2025-12-19 19:23:09.770927+00 |
| description | Schema-driven similarity engine for tabular rows - part of DistX |
| homepage | |
| repository | https://github.com/antonellof/distx |
| max_upload_size | |
| id | 1995356 |
| size | 89,368 |
A high-performance vector database with schema-driven similarity search.
DistX combines the speed of a Rust-native vector database with an innovative Similarity Engine that enables structured queries on tabular data — with full explainability and without external ML dependencies.
┌──────────────────────────────────────────────────────────────────────────┐
│ Traditional Vector Database │
│ ─────────────────────────── │
│ Your Data → External ML API → Embeddings → Vector DB → Score: 0.87 │
│ (cost per call) (black box) (unexplained) │
├──────────────────────────────────────────────────────────────────────────┤
│ DistX Similarity Engine │
│ ─────────────────────────── │
│ Your Data → Schema (JSON) → Auto-Embedding → Explainable Results │
│ (declarative) (deterministic) (name: 0.25, price: 0.22) │
└──────────────────────────────────────────────────────────────────────────┘
📖 Detailed comparison with Qdrant, Pinecone, Elasticsearch →
The first schema-driven similarity engine with built-in explainability.
Define what fields matter, insert your data, and query by example — vectors are generated automatically. No external ML services, no embedding pipelines, no black-box scores.
# 1. Define similarity schema
curl -X PUT http://localhost:6333/collections/products -H "Content-Type: application/json" -d '{
"similarity_schema": {
"fields": {
"name": {"type": "text", "weight": 0.4},
"price": {"type": "number", "distance": "relative", "weight": 0.3},
"category": {"type": "categorical", "weight": 0.2},
"brand": {"type": "categorical", "weight": 0.1}
}
}
}'
# 2. Insert data (vectors auto-generated)
curl -X PUT http://localhost:6333/collections/products/points -H "Content-Type: application/json" -d '{
"points": [
{"id": 1, "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi", "brand": "Parma"}},
{"id": 2, "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi", "brand": "Negroni"}},
{"id": 3, "payload": {"name": "iPhone 15 Pro", "price": 1199, "category": "electronics", "brand": "Apple"}}
]
}'
# 3. Query by example
curl -X POST http://localhost:6333/collections/products/similar -H "Content-Type: application/json" \
-d '{"example": {"name": "prosciutto crudo", "price": 8.0, "category": "salumi"}, "limit": 3}'
Response includes per-field explainability:
┌──────┬─────────────────────────────┬───────┬──────────────────────────────────┐
│ Rank │ Product │ Score │ Contribution Breakdown │
├──────┼─────────────────────────────┼───────┼──────────────────────────────────┤
│ 1 │ Prosciutto di Parma DOP │ 0.71 │ name: 0.22, price: 0.22 │
│ 2 │ Prosciutto cotto │ 0.68 │ name: 0.25, category: 0.20 │
│ 3 │ Coppa di Parma │ 0.53 │ category: 0.20, price: 0.25 │
└──────┴─────────────────────────────┴───────┴──────────────────────────────────┘
| Capability | Description |
|---|---|
| Schema-Driven | Declarative field definitions with typed similarity (text, number, categorical, boolean) |
| Auto-Embedding | Deterministic vector generation from structured payloads |
| Query by Example | Natural JSON queries instead of raw vectors |
| Explainable Scoring | Per-field contribution breakdown for every result |
| Dynamic Weights | Override field importance at query time without re-indexing |
| Zero External Dependencies | Fully self-contained, works offline and air-gapped |
# Override weights at query time
curl -X POST http://localhost:6333/collections/products/similar \
-d '{"example": {"name": "iPhone"}, "weights": {"price": 0.7, "name": 0.1}}'
📖 Documentation · Interactive Demo · Comparison with Alternatives
DistX maintains full compatibility with the Qdrant API, so you can:
The Similarity Engine is additive — all standard vector operations work exactly like Qdrant:
# Standard Qdrant-compatible vector search still works!
curl -X POST http://localhost:6333/collections/my_collection/points/search \
-d '{"vector": [0.1, 0.2, 0.3, ...], "limit": 10}'
# Pull and run (with persistent storage)
docker run -d --name distx \
-p 6333:6333 -p 6334:6334 \
-v distx_data:/qdrant/storage \
distx/distx:latest
# Or with docker-compose
docker-compose up -d
DistX is now running at:
curl -X PUT http://localhost:6333/collections/products \
-H "Content-Type: application/json" \
-d '{
"similarity_schema": {
"fields": {
"name": {"type": "text", "weight": 0.4},
"price": {"type": "number", "distance": "relative", "weight": 0.3},
"category": {"type": "categorical", "weight": 0.2},
"in_stock": {"type": "boolean", "weight": 0.1}
}
}
}'
curl -X PUT http://localhost:6333/collections/products/points \
-H "Content-Type: application/json" \
-d '{
"points": [
{"id": 1, "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi", "in_stock": true}},
{"id": 2, "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi", "in_stock": true}},
{"id": 3, "payload": {"name": "Mortadella Bologna", "price": 3.99, "category": "salumi", "in_stock": false}},
{"id": 4, "payload": {"name": "Parmigiano Reggiano", "price": 18.99, "category": "cheese", "in_stock": true}},
{"id": 5, "payload": {"name": "Grana Padano", "price": 14.99, "category": "cheese", "in_stock": true}}
]
}'
# Find products similar to "prosciutto crudo around $8"
curl -X POST http://localhost:6333/collections/products/similar \
-H "Content-Type: application/json" \
-d '{
"example": {
"name": "prosciutto crudo",
"price": 8.0,
"category": "salumi"
},
"limit": 3
}'
Response with explainable scores:
{
"result": [
{
"id": 1,
"score": 0.71,
"payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi"},
"explain": {"name": 0.22, "price": 0.24, "category": 0.20, "in_stock": 0.05}
},
{
"id": 2,
"score": 0.65,
"payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi"},
"explain": {"name": 0.25, "price": 0.15, "category": 0.20, "in_stock": 0.05}
}
]
}
# Find cheaper alternatives (boost price importance)
curl -X POST http://localhost:6333/collections/products/similar \
-H "Content-Type: application/json" \
-d '{
"example": {"name": "prosciutto", "category": "salumi"},
"weights": {"price": 0.6, "name": 0.2, "category": 0.2},
"limit": 3
}'
# Find products similar to ID 4 (Parmigiano Reggiano)
curl -X POST http://localhost:6333/collections/products/similar \
-H "Content-Type: application/json" \
-d '{"like_id": 4, "limit": 3}'
# Full demo with sample data
python scripts/similarity_demo.py
# Or run specific demos
python scripts/similarity_demo.py --demo products
python scripts/similarity_demo.py --demo suppliers
DistX also supports standard Qdrant-compatible vector operations:
# Create collection with vectors
curl -X PUT http://localhost:6333/collections/embeddings \
-H "Content-Type: application/json" \
-d '{"vectors": {"size": 128, "distance": "Cosine"}}'
# Insert with vectors
curl -X PUT http://localhost:6333/collections/embeddings/points \
-H "Content-Type: application/json" \
-d '{
"points": [
{"id": 1, "vector": [0.1, 0.2, ...], "payload": {"text": "example"}}
]
}'
# Vector search
curl -X POST http://localhost:6333/collections/embeddings/points/search \
-H "Content-Type: application/json" \
-d '{"vector": [0.1, 0.2, ...], "limit": 10}'
# From crates.io
cargo install distx
distx --data-dir ./data
# From source
git clone https://github.com/antonellof/distx
cd distx && cargo build --release
./target/release/distx
| Metric | Performance |
|---|---|
| Vector Insert | ~8,000 ops/sec |
| Vector Search | ~400-500 ops/sec |
| Search Latency (p50) | ~2ms |
| Search Latency (p99) | ~5ms |
| Similarity Query | <1ms overhead |
Benchmarks: 5,000 vectors, 128 dimensions, Cosine distance
| Guide | Description |
|---|---|
| Similarity Engine | Schema-driven similarity for tabular data |
| Similarity Demo | Interactive walkthrough with examples |
| Comparison | DistX vs Qdrant, Pinecone, Elasticsearch |
| Quick Start | Get started in 5 minutes |
| Docker Guide | Container deployment |
| API Reference | REST and gRPC endpoints |
| Architecture | System design |
Problem: "Show me products similar to this one" — but similarity means different things (style, price, brand).
# Similar products for "customers also viewed"
curl -X POST /collections/products/similar -d '{
"example": {"name": "Nike Air Max 90", "price": 129, "category": "sneakers", "brand": "Nike"},
"limit": 6
}'
# Budget alternatives (boost price importance)
curl -X POST /collections/products/similar -d '{
"like_id": 123,
"weights": {"price": 0.6, "category": 0.3, "brand": 0.1}
}'
Use cases:
Problem: Find the best supplier match based on multiple criteria without building ML pipelines.
# Find suppliers similar to your top performer
curl -X POST /collections/suppliers/similar -d '{
"example": {
"industry": "manufacturing",
"annual_revenue": 5000000,
"employee_count": 150,
"certified": true,
"location": "Milan"
},
"limit": 10
}'
Use cases:
Problem: Find similar customers for segmentation, lead scoring, or churn prediction.
# Find customers similar to your best ones
curl -X POST /collections/customers/similar -d '{
"example": {
"industry": "fintech",
"company_size": "enterprise",
"annual_spend": 50000,
"engagement_score": 85
}
}'
# Find leads similar to closed-won deals
curl -X POST /collections/leads/similar -d '{
"like_id": "deal_12345",
"weights": {"deal_size": 0.4, "industry": 0.3, "company_size": 0.3}
}'
Use cases:
Problem: Find duplicate or near-duplicate records without exact matching.
# Find potential duplicates
curl -X POST /collections/contacts/similar -d '{
"example": {"name": "John Smith", "email": "j.smith@acme.com", "company": "Acme Inc"},
"limit": 5
}'
# Response shows WHY records might be duplicates
# → name: 0.35 (similar names)
# → company: 0.25 (same company)
# → email: 0.15 (different email domain)
Use cases:
Problem: Explore datasets by finding similar records without writing complex SQL.
# "Find transactions similar to this suspicious one"
curl -X POST /collections/transactions/similar -d '{
"example": {"amount": 9999, "merchant_category": "travel", "country": "unusual"},
"weights": {"amount": 0.5, "merchant_category": 0.3}
}'
# "Find properties similar to this sold one"
curl -X POST /collections/properties/similar -d '{
"example": {"sqft": 2500, "bedrooms": 4, "neighborhood": "downtown", "year_built": 2010}
}'
Use cases:
Problem: Need similarity search with full auditability — can't use black-box ML.
Why DistX:
# Healthcare: Find similar patient cases
curl -X POST /collections/patients/similar -d '{
"example": {"diagnosis_code": "E11.9", "age_group": "65+", "comorbidities": 3}
}'
# Response includes full explanation for audit trail:
# {
# "score": 0.78,
# "explain": {
# "diagnosis_code": 0.35, ← Same diagnosis
# "age_group": 0.25, ← Same age bracket
# "comorbidities": 0.18 ← Similar complexity
# }
# }
Use cases:
# Find comparable properties for valuation
curl -X POST /collections/properties/similar -d '{
"example": {
"sqft": 2200,
"bedrooms": 3,
"bathrooms": 2,
"year_built": 2015,
"neighborhood": "downtown",
"property_type": "condo"
},
"weights": {"sqft": 0.3, "neighborhood": 0.25, "property_type": 0.2}
}'
Use cases:
# Find candidates similar to your top performers
curl -X POST /collections/employees/similar -d '{
"example": {
"department": "engineering",
"years_experience": 5,
"skills": "rust,python",
"performance_rating": "exceeds"
}
}'
Use cases:
[dependencies]
distx = "0.2.5"
distx-similarity = "0.2.5" # Similarity Engine
distx-core = "0.2.5" # Core data structures
use distx_similarity::{SimilaritySchema, FieldConfig, StructuredEmbedder, Reranker};
use std::collections::HashMap;
// Define schema
let mut fields = HashMap::new();
fields.insert("name".to_string(), FieldConfig::text(0.5));
fields.insert("price".to_string(), FieldConfig::number(0.3, DistanceType::Relative));
fields.insert("category".to_string(), FieldConfig::categorical(0.2));
let schema = SimilaritySchema::new(fields);
let embedder = StructuredEmbedder::new(schema.clone());
// Auto-generate vector from payload
let payload = json!({"name": "Prosciutto", "price": 8.99, "category": "salumi"});
let vector = embedder.embed(&payload);
Licensed under MIT OR Apache-2.0 at your option.