| Crates.io | umap-rs |
| lib.rs | umap-rs |
| version | 0.4.5 |
| created_at | 2025-11-23 01:36:12.740535+00 |
| updated_at | 2025-12-01 05:27:26.609769+00 |
| description | Fast, parallel, memory-efficient Rust implementation of UMAP |
| homepage | https://github.com/wilsonzlin/umap-rs |
| repository | https://github.com/wilsonzlin/umap-rs |
| max_upload_size | |
| id | 1945971 |
| size | 129,715 |
Fast, parallel Rust implementation of the core UMAP algorithm. Clean, modern Rust implementation focused on performance and correctness.
Scales to 264 million samples in 2 hours on a 126-core machine (see Performance).
See DIVERGENCES.md for detailed comparison to Python umap-learn.
use ndarray::Array2;
use umap::{GraphParams, Umap, UmapConfig};
// Configure UMAP parameters
let config = UmapConfig {
n_components: 2,
graph: GraphParams {
n_neighbors: 15,
..Default::default()
},
..Default::default()
};
// Create UMAP instance
let umap = Umap::new(config);
// Precompute KNN (use your favorite ANN library: pynndescent, hnswlib, etc.)
let knn_indices: Array2<u32> = /* shape (n_samples, n_neighbors) */;
let knn_dists: Array2<f32> = /* shape (n_samples, n_neighbors) */;
// Provide initialization (see Initialization section below)
// Common choices: random, PCA, or your own custom embedding
let init: Array2<f32> = /* shape (n_samples, n_components) */;
// Fit UMAP to data
let model = umap.fit(
data.view(),
knn_indices.view(),
knn_dists.view(),
init.view(),
);
// Get the embedding
let embedding = model.embedding(); // Returns ArrayView2<f32>
// Or take ownership of the embedding
let embedding = model.into_embedding(); // Returns Array2<f32>
UMAP training has two phases: learning the manifold (building the graph) and optimizing the embedding (running gradient descent). The first is deterministic and expensive, the second is iterative and can be interrupted.
For long training runs, you can checkpoint the optimization and resume if interrupted:
use umap_rs::{Metric, Optimizer};
// Phase 1: Learn the manifold structure from your data
// This builds the fuzzy topological graph. It's slow but deterministic -
// same inputs always give same manifold.
let manifold = umap.learn_manifold(data.view(), knn_indices.view(), knn_dists.view());
// Phase 2: Optimize the embedding via gradient descent
// Create an optimizer that will run 500 epochs of SGD
let metric = umap_rs::EuclideanMetric;
let mut opt = Optimizer::new(manifold, init, 500, &config, metric.metric_type());
// Train in chunks of 10 epochs at a time
while opt.remaining_epochs() > 0 {
opt.step_epochs(10, &metric); // Run 10 more epochs
// Periodically save a checkpoint (embedding + all optimization state)
if opt.current_epoch() % 50 == 0 {
std::fs::write(
format!("checkpoint_{}.bin", opt.current_epoch()),
bincode::serialize(&opt)?
)?;
}
}
// Training done - convert to final lightweight model
let fitted = opt.into_fitted(config);
If your process is interrupted, load the checkpoint and continue:
// Deserialize the optimizer state from disk
let mut opt: Optimizer = bincode::deserialize(&std::fs::read("checkpoint_250.bin")?)?;
// Continue from epoch 250 to 500
while opt.remaining_epochs() > 0 {
opt.step_epochs(10, &metric);
}
let fitted = opt.into_fitted(config);
The checkpoint contains everything: current embedding, epoch counters, and the manifold. When training completes, convert to a final model:
// Training done - drop the heavy optimization state
let fitted = opt.into_fitted(config);
// Access the embedding
let embedding = fitted.embedding(); // Zero-copy view
// Or take ownership
let embedding = fitted.into_embedding();
// Save the final model (much smaller than checkpoints)
std::fs::write("model.bin", bincode::serialize(&fitted)?)?;
The serialized FittedUmap contains just the manifold and embedding, not the optimization state, making it lightweight for long-term storage.
You must provide your own initialization. This library is designed to be minimal and focused on the core UMAP optimization - initialization is left to the caller.
Random initialization (simplest):
use ndarray::Array2;
use rand::Rng;
fn random_init(n_samples: usize, n_components: usize) -> Array2<f32> {
let mut rng = rand::thread_rng();
Array2::from_shape_fn((n_samples, n_components), |_| {
rng.gen_range(-10.0..10.0)
})
}
PCA initialization (recommended for better convergence):
// Use any PCA library (e.g., linfa-reduction, ndarray-stats, etc.)
// Project data to first n_components principal components
// Scale to roughly [-10, 10] range
Custom initialization:
UMAP parameters are grouped logically:
use umap::UmapConfig;
let config = UmapConfig {
n_components: 2, // Output dimensions
..Default::default()
};
use umap::config::ManifoldParams;
let manifold = ManifoldParams {
min_dist: 0.1, // Minimum distance in embedding
spread: 1.0, // Scale of embedding
a: None, // Auto-computed from min_dist/spread
b: None, // Auto-computed from min_dist/spread
};
use umap::config::GraphParams;
let graph = GraphParams {
n_neighbors: 15, // Number of nearest neighbors
local_connectivity: 1.0, // Local neighborhood connectivity
set_op_mix_ratio: 1.0, // Fuzzy union (1.0) vs intersection (0.0)
disconnection_distance: None, // Auto-computed from metric
symmetrize: true, // Symmetrize graph (set false to save memory)
};
The symmetrize option controls whether the fuzzy graph is symmetrized via fuzzy set union. For very large datasets, setting symmetrize: false roughly halves memory usage with minimal impact on 2D visualization quality.
use umap::config::OptimizationParams;
let optimization = OptimizationParams {
n_epochs: None, // Auto-determined from dataset size
learning_rate: 1.0, // SGD learning rate
negative_sample_rate: 5, // Negative samples per positive
repulsion_strength: 1.0, // Weight for negative samples
};
let config = UmapConfig {
n_components: 3,
manifold: ManifoldParams {
min_dist: 0.05,
..Default::default()
},
graph: GraphParams {
n_neighbors: 30,
..Default::default()
},
optimization: OptimizationParams {
n_epochs: Some(500),
..Default::default()
},
};
Implement the Metric trait:
use umap::Metric;
use ndarray::{Array1, ArrayView1};
#[derive(Debug)]
struct MyMetric;
impl Metric for MyMetric {
fn distance(&self, a: ArrayView1<f32>, b: ArrayView1<f32>) -> (f32, Array1<f32>) {
// Return (distance, gradient)
// gradient = ∂distance/∂a
todo!()
}
// Optional: provide fast squared distance for optimization
fn squared_distance(&self, a: ArrayView1<f32>, b: ArrayView1<f32>) -> Option<f32> {
None // Return Some(dist_sq) if available
}
}
// Use custom metric
let umap = Umap::with_metrics(
config,
Box::new(MyMetric), // Input space metric
Box::new(EuclideanMetric), // Output space metric
);
See src/distances.rs for the Euclidean implementation example.
cargo build --release
This implementation is designed for large-scale datasets (100M+ samples) on high-core-count machines.
264 million samples embedded to 2D in 2 hours on a 126-core AMD EPYC 9J45 with 1.4 TB RAM:
Every phase is fully parallelized via Rayon:
Optimized for minimal memory footprint at scale:
.clone() on large arrays; uses parallel copies when needed.Memory scales with n_samples × n_neighbors:
| n_samples | n_neighbors | Approx. Memory |
|---|---|---|
| 10M | 30 | ~10 GB |
| 100M | 30 | ~100 GB |
| 250M | 30 | ~250 GB |
| 250M | 256 | ~2 TB |
To reduce memory:
n_neighbors (15-50 is typical for visualization)config.graph.symmetrize = falselet config = UmapConfig {
graph: GraphParams {
n_neighbors: 30, // Lower = less memory
symmetrize: false, // Skip symmetrization to save memory
..Default::default()
},
..Default::default()
};
The library emits structured logs via the tracing crate. Enable a subscriber to see timing for each phase:
tracing_subscriber::fmt::init(); // or your preferred subscriber
Example output:
INFO umap_rs::umap::fuzzy_simplicial_set: smooth_knn_dist complete duration_ms=52033
INFO umap_rs::umap::fuzzy_simplicial_set: csr row_counts complete duration_ms=48495
INFO umap_rs::umap::fuzzy_simplicial_set: csr indptr complete duration_ms=560 nnz=62586367074
INFO umap_rs::optimizer: optimizer edge filtering complete duration_ms=725 total_edges=23276942679
The fuzzy simplicial set graph is exposed as SparseMat (a CsMatI<f32, u32, usize>):
use umap_rs::SparseMat;
let manifold = umap.learn_manifold(data.view(), knn_indices.view(), knn_dists.view());
let graph: &SparseMat = manifold.graph();
// graph uses u32 column indices (memory efficient) and usize row pointers (handles large nnz)
Run cargo doc --open to browse the API documentation.
u32 indices internally for memory efficiencyIf your KNN search couldn't find k neighbors for some points (e.g., isolated points), use u32::MAX as a sentinel index and any distance value (commonly f32::INFINITY). These entries are automatically skipped during graph construction:
// Point 5 only has 2 real neighbors, rest are sentinels
knn_indices[[5, 0]] = 10; // real neighbor
knn_indices[[5, 1]] = 23; // real neighbor
knn_indices[[5, 2]] = u32::MAX; // sentinel - skipped
knn_indices[[5, 3]] = u32::MAX; // sentinel - skipped
This is a specialized tool for the core algorithm. Wrap it in validation/error handling for production use.
BSD-3-Clause (see LICENSE file)