Crates.io | pacmap |
lib.rs | pacmap |
version | 0.2.5 |
source | src |
created_at | 2024-11-05 06:53:46.055274 |
updated_at | 2024-11-18 02:59:52.971705 |
description | Pairwise Controlled Manifold Approximation (PaCMAP) for dimensionality reduction |
homepage | |
repository | https://github.com/beamform/pacmap-rs |
max_upload_size | |
id | 1436090 |
size | 127,293 |
A Rust implementation of PaCMAP (Pairwise Controlled Manifold Approximation) for dimensionality reduction based on the original Python implementation.
Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving important relationships between points. This is useful for visualization, analysis, and as preprocessing for other algorithms.
PaCMAP is a relatively recent dimensionality reduction technique that preserves both local and global structure through three types of point relationships:
For details on the algorithm, see the original paper.
Basic usage with default parameters:
use anyhow::Result;
use ndarray::Array2;
use ndarray_rand::RandomExt;
use ndarray_rand::rand_distr::Uniform;
use pacmap::{Configuration, fit_transform};
fn main() -> Result<()> {
// Your high-dimensional data as an n × d array
let n_samples = 1000;
let n_features = 1000;
let mut data = Array2::random((n_samples, n_features), Uniform::new(-1.0, 1.0));
let config = Configuration::default();
let (embedding, _) = fit_transform(data.view(), config)?;
// embedding is now an n × 2 array
Ok(())
}
Customized embedding:
use anyhow::Result;
use pacmap::{Configuration, Initialization};
fn main() -> Result<()> {
let config = Configuration::builder()
.embedding_dimensions(3)
.initialization(Initialization::Random(Some(42)))
.learning_rate(0.8)
.num_iters((50, 50, 100))
.mid_near_ratio(0.3)
.far_pair_ratio(2.0)
.approx_threshold(8_000) // Use approximate neighbors above this size
.build();
let (embedding, _) = fit_transform(data.view(), config)?;
Ok(())
}
Capturing intermediate states:
use anyhow::Result;
use pacmap::Configuration;
fn main() -> Result<()> {
let config = Configuration::builder()
.snapshots(vec![100, 200, 300])
.build();
let (embedding, Some(states)) = fit_transform(data.view(), config)?;
// states is now an s × n × d array where s is the number of snapshots
Ok(())
}
For a standalone example, see the pacmap-rs-example repository.
embedding_dimensions
: Output dimensionality (default: 2)initialization
: How to initialize coordinates:
Pca
- Project data using PCA (default)Value(array)
- Use provided coordinatesRandom(seed)
- Random initialization with optional seedlearning_rate
: Learning rate for Adam optimizer (default: 1.0)num_iters
: Iteration counts for three optimization phases (default: (100, 100, 250)):
snapshots
: Optional vector of iterations at which to save embedding statesapprox_threshold
: Number of samples above which to use approximate nearest neighbors (default: 8,000)mid_near_ratio
: Ratio of mid-near to nearest neighbor pairs (default: 0.5)far_pair_ratio
: Ratio of far to nearest neighbor pairs (default: 2.0)override_neighbors
: Optional fixed neighbor count override (default: None, auto-scaled with dataset size)seed
: Optional random seed for reproducible sampling and initializationPairConfiguration::Generate
- Generate all pairs from scratch (default)PairConfiguration::NeighborsProvided { pair_neighbors }
- Use provided nearest neighbors, generate remaining pairsPairConfiguration::AllProvided { pair_neighbors, pair_mn, pair_fp }
- Use all provided pairsOnly one BLAS/LAPACK backend feature should be enabled at a time. These are required for PCA operations except on macOS which uses Accelerate by default.
intel-mkl-static
- Static linking with Intel MKLintel-mkl-system
- Dynamic linking with system Intel MKLopenblas-static
- Static linking with OpenBLASopenblas-system
- Dynamic linking with system OpenBLASnetlib-static
- Static linking with Netlibnetlib-system
- Dynamic linking with system NetlibFor example:
[dependencies]
pacmap = { version = "0.2", features = ["openblas-static"] }
See ndarray-linalg's documentation for detailed information about BLAS/LAPACK configuration and performance considerations.
simsimd
- Enable SIMD optimizations in USearch for faster approximate nearest neighbor search.
Requires GCC 13+ for compilation and a recent glibc at runtime.This implementation currently:
Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization. Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Journal of Machine Learning Research, 22(201), 1-73.
Apache License, Version 2.0