[![Rust](https://github.com/cjneely10/affinityprop/actions/workflows/rust.yml/badge.svg?branch=main)](https://github.com/cjneely10/affinityprop/actions/workflows/rust.yml) [![affinityprop crate](https://img.shields.io/crates/v/affinityprop?color=blue&style=flat)](https://crates.io/crates/affinityprop) [![affinityprop docs](https://img.shields.io/docsrs/affinityprop)](https://docs.rs/affinityprop/0.2.0/affinityprop/) [![GitHub](https://img.shields.io/github/license/cjneely10/affinityprop)](https://www.gnu.org/licenses/gpl-3.0.html) [![affinityprop: rustc 1.58](https://img.shields.io/badge/affinityprop-rustc__1.58-blue)](https://doc.rust-lang.org/rustc/what-is-rustc.html) ![coverage](https://img.shields.io/badge/coverage-95%25-success) # affinityprop The `affinityprop` crate provides an optimized implementation of the Affinity Propagation clustering algorithm, which identifies cluster of data without *a priori* knowledge about the number of clusters in the data. The original algorithm was developed by Brendan Frey and Delbery Dueck # About Affinity Propagation identifies a subset of representative examples from a dataset, known as **exemplars**. Briefly, the algorithm accepts as input a matrix describing **pairwise similarity** for all data values. This information is used to calculate pairwise **responsibility** and **availability**. Responsibility *`r(i,j)`* describes how well-suited point *`j`* is to act as an exemplar for point *`i`* when compared to other potential exemplars. Availability *`a(i,j)`* describes how appropriate it is for point *i* to accept point *j* as its exemplar when compared to other exemplars. Users provide a number of **convergence iterations** to repeat the calculations, after which the potential exemplars are extracted from the dataset. Then, the algorithm continues to repeat until the exemplar values stop changing, or until the **maximum iterations** are met. # Why this crate? The nature of Affinity Propagation demands an *O(n2)* runtime. An existing sklearn version is implemented using the Python library numpy which incorporates vectorized row operations. Coupled with **SIMD** instructions, this results in decreased time to finish. However, in applications with large input values, the *O(n2)* runtime is still prohibitive. This crate implements Affinity Propagation using the rayon crate, which allows for a drastic decrease in overall runtime - as much as 30-60% when compiled in release mode! # Dependencies cargo with `rustc >=1.58` # Installation ## In Rust code ```toml [dependencies] affinityprop = "0.2.0" ndarray = "0.15.4" ``` ## As a command-line tool ```shell cargo install affinityprop ``` # Usage ## From Rust code The `affinityprop` crate expects a type that defines how to calculate pairwise `Similarity` for all data points. This crate provides the `NegEuclidean`, `NegCosine`, and `LogEuclidean` structs, which are defined as `-1 * sum((a - b)**2)`, `-1 * (a . b)/(|a|*|b|)`, and `sum(log((a - b)**2))`, respectively. Users who wish to calculate similarity differently are advised that **Affinity Propagation expects *s(i,j)* > *s(i, k)* iff *i* is more similar to *j* than it is to *k***. ```rust use ndarray::{arr1, arr2, Array2}; use affinityprop::{AffinityPropagation, NegCosine, Preference}; let x: Array2 = arr2(&[[0., 1., 0.], [2., 3., 2.], [3., 2., 3.]]); // Cluster using negative cosine similarity with a pre-defined preference let ap = AffinityPropagation::default(); let (converged, results) = ap.predict(&x, NegCosine::default(), Preference::Value(-10.)); assert!(converged && results.len() == 1 && results.contains_key(&0)); // Cluster with list of preference values let pref = arr1(&[0., -1., 0.]); let (converged, results) = ap.predict(&x, NegCosine::default(), Preference::List(&pref)); assert!(converged); assert!(results.len() == 2 && results.contains_key(&0) && results.contains_key(&2)); // Use damping=0.5, threads=2, convergence_iter=10, max_iterations=100, // median similarity as preference let ap = AffinityPropagation::new(0.5, 2, 10, 100); let (converged, results) = ap.predict(&x, NegCosine::default(), Preference::Median); assert!(converged); assert!(results.len() == 2 && results.contains_key(&0) && results.contains_key(&2)); // Predict with pre-calculated similarity let s: Array2 = arr2(&[[0., -3., -12.], [-3., 0., -3.], [-12., -3., 0.]]); let ap = AffinityPropagation::default(); let (converged, results) = ap.predict_precalculated(s, Preference::Value(-10.)); assert!(converged && results.len() == 1 && results.contains_key(&1)); ``` ## From the Command Line `affinityprop` can be run from the command-line and used to analyze a file of data: ```text ID1 val1 val2 ID2 val3 val4 ID3 val5 val6 ``` where ID*n* is any string identifier and val*n* are floating-point (decimal) values. The file delimiter is provided from the command line with the `-l` flag. Similarity will be calculated based on the option set by the `-s` flag. For files without row ids: ```text val1 val2 val3 val4 val5 val6 ``` provide the `-n` flag from the command line. IDs will automatically be assigned by zero-based index. Users may instead provide a pre-calculated similarity matrix by passing the `-s 3` flag from the command line and by structuring their input file as: ```text ID1 sim11 sim12 sim13 ID2 sim21 sim22 sim23 ID3 sim31 sim32 sim33 ``` where row*i*, col*j* is the pairwise similarity between inputs *i* and *j*. Or, for files without row labels, users may pass `-n -s 3`: ```text sim11 sim12 sim13 sim21 sim22 sim23 sim31 sim32 sim33 ``` IDs will automatically be assigned by zero-based index. ### Help Menu ```text affinityprop 0.2.0 Chris N. Vectorized and Parallelized Affinity Propagation USAGE: affinityprop [OPTIONS] --input FLAGS: -n, --no_labels Input file does not contain IDS as the first column -h, --help Prints help information -V, --version Prints version information OPTIONS: -c, --convergence_iter Convergence iterations, default=10 -d, --damping Damping value in range (0, 1), default=0.9 -l, --delimiter File delimiter, default '\t' -i, --input Path to input file -m, --max_iter Maximum iterations, default=100 -r, --precision Set f32 or f64 precision, default=f32 -p, --preference Preference to be own exemplar, default=median pairwise similarity -s, --similarity Set similarity calculation method (0=NegEuclidean,1=NegCosine,2=LogEuclidean,3=precalculated), default=0 -t, --threads Number of worker threads, default=4 ``` ### Results Results are printed to stdout in the format: ```text Converged=true/false nClusters=NC nSamples=NS >Cluster=n size=N exemplar=i [comma-separated cluster member IDs/indices] >Cluster=n size=N exemplar=i [comma-separated cluster member IDs/indices] ... ``` # Runtime and Resource Notes Affinity Propagation is *O(n2)* in both runtime and memory. This crate seeks to address the former, not the latter. An estimated memory usage can be calculated given: ```text memory(GB) = p * 4 * N^2 / 2^30 ``` For `N` inputs. `p = 4` for 32-bit floating-point precision and `p = 8` for 64-bit. ## Comparison to sklearn implementation This implementation was tested on 50-D isotropic Gaussian blobs generated by the sklearn [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) function (n=300,10000,25000). A [dataset of biological relevance](https://github.com/edgraham/BinSanity/blob/master/example/Infant_gut_assembly.cov.x100.lognorm) was also selected (n=4189). ARI/F1 scores >= 0.99 were obtained for the Gaussian data when compared to the labels provided by `make_blobs`. ARI/F1 = 0.98 was obtained for the biological dataset when compared to `sklearn`-derived labels. This `affinityprop` implementation was compared against the [Affinity Propagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit_predict) implementation contained within scikit-learn-1.0.2, and run using numpy-1.22.2 with Python 3.10.0. This analysis was completed using a Ryzen 9 3950X processor. In all analyses, damping=0.95, convergence_iter=400, and max_iter=4000. Preference=-1000.0 for Gaussian data and -10.0 for biological data. ![Time to complete analysis, scaled by the sklearn implementation, plotted against the number of cores.](assets/combined-data.png)