| Crates.io | distx-schema |
| lib.rs | distx-schema |
| version | 0.2.7 |
| created_at | 2025-12-20 15:32:21.544906+00 |
| updated_at | 2025-12-20 17:32:24.161308+00 |
| description | Schema-driven structured similarity for tabular data - the Similarity Contract engine for DistX |
| homepage | |
| repository | https://github.com/antonellof/distx |
| max_upload_size | |
| id | 1996635 |
| size | 74,857 |
DistX does not store vectors that represent objects.
It stores objects, and derives vectors from their structure.
A high-performance vector database with the Similarity Contract — a schema-driven approach to structured similarity search that is deterministic, explainable, and requires no external ML.
The schema is not just configuration — it's a contract that governs:
| Aspect | What the Schema Controls |
|---|---|
| Ingest | How objects are converted to vectors (deterministic, reproducible) |
| Query | How similarity is computed across multiple field types |
| Ranking | How results are scored with structured distance functions |
| Explainability | How each field contributes to the final score |
This is an architectural difference, not just an API feature. You cannot replicate this with Qdrant hybrid queries without replicating half this codebase client-side.
| DistX IS | DistX is NOT |
|---|---|
| A contract-based similarity engine | A neural embedding model |
| Deterministic and reproducible | A probabilistic LLM system |
| Designed for structured/tabular data | A black-box recommender |
| Fully explainable (per-field scores) | Dependent on external ML APIs |
Target domains: ERP, e-commerce, CRM, financial data, any tabular dataset.
┌──────────────────────────────────────────────────────────────────────────┐
│ Traditional Vector Database │
│ ─────────────────────────── │
│ Your Data → External ML API → Embeddings → Vector DB → Score: 0.87 │
│ (cost per call) (black box) (unexplained) │
│ (model drift) (retraining) (no breakdown) │
├──────────────────────────────────────────────────────────────────────────┤
│ DistX with Similarity Contract │
│ ────────────────────────────── │
│ Your Data → Schema (JSON) → Deterministic → Explainable Results │
│ (contract) (no drift) (name: 0.25, price: 0.22) │
│ (stable) (reproducible) (auditable) │
└──────────────────────────────────────────────────────────────────────────┘
📖 Detailed comparison with Qdrant, Pinecone, Elasticsearch →
The first schema-driven structured similarity engine with built-in explainability.
Define a Similarity Contract, insert your data, and query by example — vectors are derived automatically from object structure. No external ML, no embedding pipelines, no black-box scores.
# 1. Define similarity schema
curl -X PUT http://localhost:6333/collections/products -H "Content-Type: application/json" -d '{
"similarity_schema": {
"fields": {
"name": {"type": "text", "weight": 0.4},
"price": {"type": "number", "distance": "relative", "weight": 0.3},
"category": {"type": "categorical", "weight": 0.2},
"brand": {"type": "categorical", "weight": 0.1}
}
}
}'
# 2. Insert data (vectors auto-generated)
curl -X PUT http://localhost:6333/collections/products/points -H "Content-Type: application/json" -d '{
"points": [
{"id": 1, "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi", "brand": "Parma"}},
{"id": 2, "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi", "brand": "Negroni"}},
{"id": 3, "payload": {"name": "iPhone 15 Pro", "price": 1199, "category": "electronics", "brand": "Apple"}}
]
}'
# 3. Query by example
curl -X POST http://localhost:6333/collections/products/similar -H "Content-Type: application/json" \
-d '{"example": {"name": "prosciutto crudo", "price": 8.0, "category": "salumi"}, "limit": 3}'
Response includes per-field explainability:
┌──────┬─────────────────────────────┬───────┬──────────────────────────────────┐
│ Rank │ Product │ Score │ Contribution Breakdown │
├──────┼─────────────────────────────┼───────┼──────────────────────────────────┤
│ 1 │ Prosciutto di Parma DOP │ 0.71 │ name: 0.22, price: 0.22 │
│ 2 │ Prosciutto cotto │ 0.68 │ name: 0.25, category: 0.20 │
│ 3 │ Coppa di Parma │ 0.53 │ category: 0.20, price: 0.25 │
└──────┴─────────────────────────────┴───────┴──────────────────────────────────┘
| Capability | Description |
|---|---|
| Schema-Driven | Declarative field definitions with typed similarity (text, number, categorical, boolean) |
| Auto-Embedding | Deterministic vector generation from structured payloads |
| Query by Example | Natural JSON queries instead of raw vectors |
| Explainable Scoring | Per-field contribution breakdown for every result |
| Dynamic Weights | Override field importance at query time without re-indexing |
| Zero External Dependencies | Fully self-contained, works offline and air-gapped |
# Same data, different meaning of "similar" — no re-indexing required
# Query 1: "Find similar products" (balanced)
curl -X POST /collections/products/similar \
-d '{"example": {"name": "iPhone 15"}, "limit": 5}'
# Query 2: "Find cheaper alternatives" (boost price)
curl -X POST /collections/products/similar \
-d '{"example": {"name": "iPhone 15"}, "weights": {"price": 0.7, "name": 0.2}, "limit": 5}'
# Query 3: "Find same brand, any price" (boost brand)
curl -X POST /collections/products/similar \
-d '{"example": {"name": "iPhone 15"}, "weights": {"brand": 0.6, "category": 0.3}, "limit": 5}'
In Qdrant: You would need to re-embed everything or build complex client-side logic.
# One Similarity Contract works across domains:
# Products
{"name": "iPhone 15", "price": 999, "category": "electronics", "brand": "Apple"}
# Suppliers
{"name": "Acme Corp", "price": 5000000, "category": "manufacturing", "brand": "certified"}
# Financial Assets
{"name": "AAPL Stock", "price": 178.50, "category": "equity", "brand": "tech"}
# Same schema, same queries, same explainability — across all datasets
This is not product-specific. The Similarity Contract is domain-agnostic.
📖 Documentation · Interactive Demo · Comparison with Alternatives
DistX maintains full compatibility with the Qdrant API, so you can:
The Similarity Engine is additive — all standard vector operations work exactly like Qdrant:
# Standard Qdrant-compatible vector search still works!
curl -X POST http://localhost:6333/collections/my_collection/points/search \
-d '{"vector": [0.1, 0.2, 0.3, ...], "limit": 10}'
# Pull and run (with persistent storage)
docker run -d --name distx \
-p 6333:6333 -p 6334:6334 \
-v distx_data:/qdrant/storage \
distx/distx:latest
# Or with docker-compose
docker-compose up -d
DistX is now running at:
curl -X PUT http://localhost:6333/collections/products \
-H "Content-Type: application/json" \
-d '{
"similarity_schema": {
"fields": {
"name": {"type": "text", "weight": 0.4},
"price": {"type": "number", "distance": "relative", "weight": 0.3},
"category": {"type": "categorical", "weight": 0.2},
"in_stock": {"type": "boolean", "weight": 0.1}
}
}
}'
curl -X PUT http://localhost:6333/collections/products/points \
-H "Content-Type: application/json" \
-d '{
"points": [
{"id": 1, "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi", "in_stock": true}},
{"id": 2, "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi", "in_stock": true}},
{"id": 3, "payload": {"name": "Mortadella Bologna", "price": 3.99, "category": "salumi", "in_stock": false}},
{"id": 4, "payload": {"name": "Parmigiano Reggiano", "price": 18.99, "category": "cheese", "in_stock": true}},
{"id": 5, "payload": {"name": "Grana Padano", "price": 14.99, "category": "cheese", "in_stock": true}}
]
}'
# Find products similar to "prosciutto crudo around $8"
curl -X POST http://localhost:6333/collections/products/similar \
-H "Content-Type: application/json" \
-d '{
"example": {
"name": "prosciutto crudo",
"price": 8.0,
"category": "salumi"
},
"limit": 3
}'
Response with explainable scores:
{
"result": [
{
"id": 1,
"score": 0.71,
"payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi"},
"explain": {"name": 0.22, "price": 0.24, "category": 0.20, "in_stock": 0.05}
},
{
"id": 2,
"score": 0.65,
"payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi"},
"explain": {"name": 0.25, "price": 0.15, "category": 0.20, "in_stock": 0.05}
}
]
}
# Find cheaper alternatives (boost price importance)
curl -X POST http://localhost:6333/collections/products/similar \
-H "Content-Type: application/json" \
-d '{
"example": {"name": "prosciutto", "category": "salumi"},
"weights": {"price": 0.6, "name": 0.2, "category": 0.2},
"limit": 3
}'
# Find products similar to ID 4 (Parmigiano Reggiano)
curl -X POST http://localhost:6333/collections/products/similar \
-H "Content-Type: application/json" \
-d '{"like_id": 4, "limit": 3}'
# Full demo with sample data
python scripts/similarity_demo.py
# Or run specific demos
python scripts/similarity_demo.py --demo products
python scripts/similarity_demo.py --demo suppliers
DistX also supports standard Qdrant-compatible vector operations:
# Create collection with vectors
curl -X PUT http://localhost:6333/collections/embeddings \
-H "Content-Type: application/json" \
-d '{"vectors": {"size": 128, "distance": "Cosine"}}'
# Insert with vectors
curl -X PUT http://localhost:6333/collections/embeddings/points \
-H "Content-Type: application/json" \
-d '{
"points": [
{"id": 1, "vector": [0.1, 0.2, ...], "payload": {"text": "example"}}
]
}'
# Vector search
curl -X POST http://localhost:6333/collections/embeddings/points/search \
-H "Content-Type: application/json" \
-d '{"vector": [0.1, 0.2, ...], "limit": 10}'
# From crates.io
cargo install distx
distx --data-dir ./data
# From source
git clone https://github.com/antonellof/distx
cd distx && cargo build --release
./target/release/distx
| Metric | Performance |
|---|---|
| Vector Insert | ~8,000 ops/sec |
| Vector Search | ~400-500 ops/sec |
| Search Latency (p50) | ~2ms |
| Search Latency (p99) | ~5ms |
| Similarity Query | <1ms overhead |
Benchmarks: 5,000 vectors, 128 dimensions, Cosine distance
| Guide | Description |
|---|---|
| Similarity Engine | Schema-driven similarity for tabular data |
| Similarity Demo | Interactive walkthrough with examples |
| Comparison | DistX vs Qdrant, Pinecone, Elasticsearch |
| Quick Start | Get started in 5 minutes |
| Docker Guide | Container deployment |
| API Reference | REST and gRPC endpoints |
| Architecture | System design |
Problem: "Show me products similar to this one" — but similarity means different things (style, price, brand).
# Similar products for "customers also viewed"
curl -X POST /collections/products/similar -d '{
"example": {"name": "Nike Air Max 90", "price": 129, "category": "sneakers", "brand": "Nike"},
"limit": 6
}'
# Budget alternatives (boost price importance)
curl -X POST /collections/products/similar -d '{
"like_id": 123,
"weights": {"price": 0.6, "category": 0.3, "brand": 0.1}
}'
Use cases:
Problem: Find the best supplier match based on multiple criteria without building ML pipelines.
# Find suppliers similar to your top performer
curl -X POST /collections/suppliers/similar -d '{
"example": {
"industry": "manufacturing",
"annual_revenue": 5000000,
"employee_count": 150,
"certified": true,
"location": "Milan"
},
"limit": 10
}'
Use cases:
Problem: Find similar customers for segmentation, lead scoring, or churn prediction.
# Find customers similar to your best ones
curl -X POST /collections/customers/similar -d '{
"example": {
"industry": "fintech",
"company_size": "enterprise",
"annual_spend": 50000,
"engagement_score": 85
}
}'
# Find leads similar to closed-won deals
curl -X POST /collections/leads/similar -d '{
"like_id": "deal_12345",
"weights": {"deal_size": 0.4, "industry": 0.3, "company_size": 0.3}
}'
Use cases:
Problem: Find duplicate or near-duplicate records without exact matching.
# Find potential duplicates
curl -X POST /collections/contacts/similar -d '{
"example": {"name": "John Smith", "email": "j.smith@acme.com", "company": "Acme Inc"},
"limit": 5
}'
# Response shows WHY records might be duplicates
# → name: 0.35 (similar names)
# → company: 0.25 (same company)
# → email: 0.15 (different email domain)
Use cases:
Problem: Explore datasets by finding similar records without writing complex SQL.
# "Find transactions similar to this suspicious one"
curl -X POST /collections/transactions/similar -d '{
"example": {"amount": 9999, "merchant_category": "travel", "country": "unusual"},
"weights": {"amount": 0.5, "merchant_category": 0.3}
}'
# "Find properties similar to this sold one"
curl -X POST /collections/properties/similar -d '{
"example": {"sqft": 2500, "bedrooms": 4, "neighborhood": "downtown", "year_built": 2010}
}'
Use cases:
Problem: Need similarity search with full auditability — can't use black-box ML.
Why DistX:
# Healthcare: Find similar patient cases
curl -X POST /collections/patients/similar -d '{
"example": {"diagnosis_code": "E11.9", "age_group": "65+", "comorbidities": 3}
}'
# Response includes full explanation for audit trail:
# {
# "score": 0.78,
# "explain": {
# "diagnosis_code": 0.35, ← Same diagnosis
# "age_group": 0.25, ← Same age bracket
# "comorbidities": 0.18 ← Similar complexity
# }
# }
Use cases:
# Find comparable properties for valuation
curl -X POST /collections/properties/similar -d '{
"example": {
"sqft": 2200,
"bedrooms": 3,
"bathrooms": 2,
"year_built": 2015,
"neighborhood": "downtown",
"property_type": "condo"
},
"weights": {"sqft": 0.3, "neighborhood": 0.25, "property_type": 0.2}
}'
Use cases:
# Find candidates similar to your top performers
curl -X POST /collections/employees/similar -d '{
"example": {
"department": "engineering",
"years_experience": 5,
"skills": "rust,python",
"performance_rating": "exceeds"
}
}'
Use cases:
[dependencies]
distx = "0.2.5"
distx-schema = "0.2.5" # Similarity Engine
distx-core = "0.2.5" # Core data structures
use distx_similarity::{SimilaritySchema, FieldConfig, StructuredEmbedder, Reranker};
use std::collections::HashMap;
// Define schema
let mut fields = HashMap::new();
fields.insert("name".to_string(), FieldConfig::text(0.5));
fields.insert("price".to_string(), FieldConfig::number(0.3, DistanceType::Relative));
fields.insert("category".to_string(), FieldConfig::categorical(0.2));
let schema = SimilaritySchema::new(fields);
let embedder = StructuredEmbedder::new(schema.clone());
// Auto-generate vector from payload
let payload = json!({"name": "Prosciutto", "price": 8.99, "category": "salumi"});
let vector = embedder.embed(&payload);
Licensed under MIT OR Apache-2.0 at your option.