| Crates.io | rank-refine |
| lib.rs | rank-refine |
| version | 0.7.36 |
| created_at | 2025-11-26 17:57:39.941439+00 |
| updated_at | 2025-11-29 22:09:48.273027+00 |
| description | SIMD-accelerated MaxSim, cosine, diversity (MMR/DPP) for vector search and RAG pipelines |
| homepage | |
| repository | https://github.com/arclabs561/rank-refine |
| max_upload_size | |
| id | 1951943 |
| size | 465,043 |
SIMD-accelerated similarity scoring for vector search and RAG. Provides MaxSim (ColBERT), cosine similarity, diversity selection (MMR, DPP), token pooling, and Matryoshka refinement.
cargo add rank-refine
Dense retrieval encodes each document as a single vector. This works for broad matching but loses token-level alignment.
Problem: A query like "capital of France" might match a document about "France's economic capital" with high similarity, even if it never mentions "capital" in the geographic sense.
Solution: Late interaction (ColBERT-style) keeps one vector per token instead of pooling. At query time, each query token finds its best-matching document token, then we sum those matches.
Dense: "the quick brown fox" → [0.1, 0.2, ...] (1 vector)
Late Interaction: "the quick brown fox" → [[...], [...], [...], [...]] (4 vectors)
This preserves token-level semantics that single-vector embeddings lose. Useful for reranking in RAG pipelines.
Scoring primitives for retrieval systems:
| You need | This crate provides |
|---|---|
| Score pre-computed embeddings | cosine, dot, maxsim |
| ColBERT/late interaction | maxsim_vecs, maxsim_batch |
| Diversity selection | mmr_cosine, dpp |
| Compress token embeddings | pool_tokens, pool_tokens_adaptive |
| Two-stage refinement | matryoshka::refine |
What this is NOT: embedding generation, model weights, or storage systems. This crate scores embeddings you provide; it does not run model inference. See fastembed-rs for inference. Trait-based interfaces available for custom models (see crossencoder module).
use rank_refine::simd::{cosine, maxsim_vecs};
// Dense similarity
let score = cosine(&query_embedding, &doc_embedding);
// Late interaction (ColBERT)
let score = maxsim_vecs(&query_tokens, &doc_tokens);
use rank_refine::colbert;
// Query: "capital of France" (32 tokens, 128-dim embeddings)
let query = vec![
vec![0.12, -0.45, 0.89, ...], // "capital" token
vec![0.34, 0.67, -0.23, ...], // "of" token
vec![0.78, -0.12, 0.45, ...], // "France" token
// ... 29 more tokens
];
// Document: "Paris is the capital of France" (100 tokens)
let doc = vec![
vec![0.11, -0.44, 0.90, ...], // "Paris" token
vec![0.35, 0.66, -0.24, ...], // "is" token
// ... 98 more tokens
];
// MaxSim finds best matches for each query token
let score = colbert::maxsim_vecs(&query, &doc);
// "capital" matches "capital" (0.95), "France" matches "France" (0.92)
// Score = 0.95 + 0.92 + ... (sum of best matches per query token)
| Function | Input | Notes |
|---|---|---|
cosine(a, b) |
&[f32] |
Normalized, -1 to 1 |
dot(a, b) |
&[f32] |
Unnormalized |
maxsim(q, d) |
&[&[f32]] |
Sum of max similarities |
maxsim_cosine(q, d) |
&[&[f32]] |
Cosine variant |
maxsim_weighted(q, d, w) |
&[&[f32]], &[f32] |
Per-token weights |
maxsim_batch(q, docs) |
&[Vec<f32>], &[Vec<Vec<f32>>] |
Batch scoring |
| Function | Tokens Kept | Notes |
|---|---|---|
pool_tokens(t, 2) |
50% | Safe default |
pool_tokens(t, 4) |
25% | Use hierarchical feature |
pool_tokens_adaptive(t, f) |
varies | Auto-selects greedy vs ward |
pool_tokens_with_protected(t, f, n) |
varies | Keeps first n tokens unpooled |
| Function | Algorithm |
|---|---|
mmr_cosine(candidates, embeddings, config) |
Maximal Marginal Relevance |
dpp(candidates, embeddings, config) |
Determinantal Point Process |
| Function | Purpose |
|---|---|
normalize_maxsim(score, qlen) |
Scale to [0,1] |
softmax_scores(scores) |
Probability distribution |
top_k_indices(scores, k) |
Top-k by score |
blend(a, b, α) |
Linear interpolation |
MaxSim scores token-level alignment. For each query token, find its best-matching document token, then sum:
$$\text{score}(Q, D) = \sum_{i} \max_{j} (q_i \cdot d_j)$$
Visual example:
Query tokens: [q1] [q2] [q3]
\ | /
\ | / Each query token searches
v v v for its best match
Document tokens: [d1] [d2] [d3] [d4]
↑ ↑
0.9 0.8 (best matches)
MaxSim = 0.9 + 0.8 + ... (sum of best matches)
Example: Query "capital of France" (2 tokens) vs document "Paris is the capital of France" (6 tokens):
dot("capital", "capital") = 0.95dot("France", "France") = 0.92This captures token-level alignment: "capital" and "France" both have strong matches, even if they appear in different parts of the document. Single-vector embeddings average these signals and lose precision.
When to use:
When not to use:
Problem: Top-k by relevance returns near-duplicates. A search for "async programming" might return 10 Python asyncio tutorials instead of examples in Python, Rust, JavaScript, and Go.
Solution: Balance relevance with diversity by penalizing similarity to already-selected items:
$$\text{MMR} = \arg\max_d \left[ \lambda \cdot \text{rel}(d) - (1-\lambda) \cdot \max_{s \in S} \text{sim}(d,s) \right]$$
Lambda parameter:
λ = 1.0: Pure relevance (equivalent to top-k)λ = 0.5: Balanced (common default for RAG)λ = 0.3: Strong diversity (exploration mode)λ = 0.0: Maximum diversity (ignore relevance)When to use:
When not to use:
Storage: ColBERT stores one vector per token. For 10M documents with 100 tokens each:
Token pooling: Cluster similar document tokens and store only cluster centroids:
Before: [tok1] [tok2] [tok3] [tok4] [tok5] [tok6] (6 vectors)
similar--^ similar------^
After: [mean(1,2)] [tok3] [mean(4,5)] [tok6] (4 vectors = 33% reduction)
Why it works: Many tokens are redundant. Function words cluster together, as do related content words. Merging similar tokens loses little discriminative information.
| Factor | Tokens Kept | MRR@10 Loss | When to Use |
|---|---|---|---|
| 2 | 50% | 0.1–0.3% | Default choice |
| 3 | 33% | 0.5–1.0% | Good tradeoff |
| 4 | 25% | 1.5–3.0% | Storage-constrained (use hierarchical feature) |
| 8+ | 12% | 5–10% | Extreme storage constraints only |
Numbers from MS MARCO dev (Clavie et al., 2024). Pool at index time. Re-score with unpooled embeddings at query time if needed.
When pooling hurts more:
Measured on Apple M3 Max with cargo bench:
| Operation | Dim | Time |
|---|---|---|
dot |
128 | 13ns |
dot |
768 | 126ns |
cosine |
128 | 40ns |
cosine |
768 | 380ns |
maxsim |
32q×128d×128dim | 49μs |
maxsim (pooled 2x) |
32q×64d×128dim | 25μs |
maxsim (pooled 4x) |
32q×32d×128dim | 12μs |
These timings enable real-time reranking of 100-1000 candidates.
If you prefer not to add a dependency:
src/simd.rs is self-contained (~600 lines)What problem are you solving?
Scoring embeddings → Use cosine or dot
cosine(&query, &doc)dot(&query, &doc)Reranking with token-level precision → Use maxsim_vecs
Diversity selection → Use mmr_cosine or dpp
Compressing token embeddings → Use pool_tokens
Two-stage refinement with Matryoshka → Use matryoshka::refine
When not to use:
See REFERENCE.md for algorithm details and edge cases.
| Feature | Dependency | Purpose |
|---|---|---|
hierarchical |
kodama | Ward's clustering (better at 4x+) |
MIT OR Apache-2.0