| Crates.io | jin |
| lib.rs | jin |
| version | 0.1.0 |
| created_at | 2026-01-18 16:47:57.766891+00 |
| updated_at | 2026-01-18 16:47:57.766891+00 |
| description | Approximate Nearest Neighbor Search: HNSW, DiskANN, IVF-PQ, ScaNN, quantization |
| homepage | |
| repository | https://github.com/arclabs561/jin |
| max_upload_size | |
| id | 2052648 |
| size | 2,578,171 |
Approximate nearest neighbor search in Rust.
Algorithms and benchmarks for vector search.
(jin: Chinese 近 "near")
Dual-licensed under MIT or Apache-2.0.
jin actually does today)Different index implementations in jin currently assume different distance semantics.
This is not yet uniform across the crate.
| Component | Metric | Notes |
|---|---|---|
hnsw::HNSWIndex |
cosine distance | Fast path assumes L2-normalized vectors |
ivf_pq::IVFPQIndex |
cosine distance | Uses dot-based cosine distance for IVF + PQ |
scann::SCANNIndex |
inner product / cosine | Uses dot products; reranking uses cosine distance |
hnsw::dual_branch::DualBranchHNSW |
L2 distance | Experimental implementation |
quantization |
Hamming-like / binary distances | See quantization::simd_ops::hamming_distance and ternary helpers |
use jin::hnsw::HNSWIndex;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Vectors should be L2-normalized for cosine distance.
let vectors: Vec<Vec<f32>> = vec![
vec![1.0, 0.0, 0.0, 0.0],
vec![0.0, 1.0, 0.0, 0.0],
vec![0.0, 0.0, 1.0, 0.0],
vec![0.0, 0.0, 0.0, 1.0],
];
let query = vec![1.0, 0.0, 0.0, 0.0];
let mut index = HNSWIndex::new(4, 16, 32)?; // dim, m, m_max
for (id, v) in vectors.iter().enumerate() {
index.add_slice(id as u32, v)?;
}
index.build()?;
let results = index.search(&query, 2, 50)?; // k, ef_search
println!("{results:?}");
Ok(())
}
Given a query vector, find the k most similar vectors from a collection. Brute force computes all N distances — O(N) per query. For 1M vectors, that's 1M distance computations per query.
ANN algorithms trade exactness for speed. Instead of guaranteeing the true nearest neighbors, they find neighbors that are probably correct, most of the time.
HNSW (Hierarchical Navigable Small World) builds a graph where:
Search starts at the top layer (sparse, long-range edges), descends through layers, and greedily follows edges toward the query.
Layer 2: A -------- B (sparse, fast traversal)
| |
Layer 1: A -- C -- B -- D (medium density)
| | | |
Layer 0: A-E-C-F-B-G-D-H (dense, high recall)
The ef_search parameter controls how many candidates HNSW explores:
Higher ef_search = better recall, slower queries. For most applications, ef_search=50-100 achieves >95% recall.
Not all datasets are equal. Recall depends on data characteristics:
Based on He et al. (2012) and Radovanovic et al. (2010):
cargo run --example 03_quick_benchmark --release # bench (medium)
JIN_DATASET=hard cargo run --example 03_quick_benchmark --release # hard (stress test)
Different algorithms suit different constraints:
| Algorithm | Best For | Tradeoff |
|---|---|---|
| Brute force | < 10K vectors | Exact, but O(N) |
| LSH | Binary/sparse data | Fast, lower recall |
| IVF-PQ | Memory-constrained | Compressed, lower recall |
| HNSW | General use | Strong recall/latency tradeoff |
Graph construction time scales with M (edges per node):
Higher M = better recall, but more memory and slower builds.
Memory usage scales linearly with vector count:
For dim=128, M=16: approximately 0.5 KB per vector (vector + graph edges).
| Type | Implementations |
|---|---|
| Graph | HNSW, NSW, Vamana (DiskANN), SNG |
| Hash | LSH, MinHash, SimHash |
| Partition | IVF-PQ, ScaNN |
| Quantization | PQ, RaBitQ |
[dependencies]
jin = { version = "0.1", features = ["hnsw"] }
hnsw — HNSW graph index (default)lsh — Locality-Sensitive Hashingivf_pq — Inverted File with Product Quantizationpersistence — WAL-based durabilityBuild with native CPU optimizations:
RUSTFLAGS="-C target-cpu=native" cargo build --release
Run benchmarks:
cargo bench
See examples/ for more: semantic search, IVF-PQ, LSH, LID, and real dataset benchmarks.
For benchmarking datasets, see doc/datasets.md — covers bundled data, ann-benchmarks.com datasets, and typical embedding dimensions.