clustr

Crates.ioclustr
lib.rsclustr
version0.1.2
sourcesrc
created_at2022-08-19 12:42:12.863225
updated_at2022-08-20 12:10:08.231777
descriptionMultithreaded string clustering
homepagehttps://github.com/TristanBester/clustr
repositoryhttps://github.com/TristanBester/clustr
max_upload_size
id648676
size24,644
Tristan Bester (TristanBester)

documentation

README

CluStr

Test codecov Crates.io License: MIT


Documentation: https://docs.rs/clustr/0.1.2/clustr/

Crate: https://crates.io/crates/clustr

Source Code: https://github.com/TristanBester/clustr


Description

This crate provides a scalable string clustering implementation.

Strings are aggregated into clusters based on pairwise Levenshtein distance. If the distance is below a set fraction of the shorter string’s length, the strings are added to the same cluster.

Multithreading Model

  • The input strings are evenly paritioned across the set of allocated threads.
  • Once each thread has clustered its associated input strings, result aggregation is started.
  • Clusters are merged in pairs accross multiple threads in a manner that is similar to traversing a binary tree from the leaves up to the root. The root of the tree is the final clustering.
  • Thus, if there are N threads allocated, there will be ceil(log2(N)) merge operations.

Installation

[dependencies]
clustr = "0.1.2"

Getting Started

Basic usage:

let inputs = vec!["aaaa", "aaax", "bbbb", "bbbz"];
let expected = vec![vec!["aaaa", "aaax"], vec!["bbbb", "bbbz"]];

let clusters = clustr::cluster_strings(&inputs, 0.25, 1)?;

assert_eq!(clusters, expected);

Multithreading:

let inputs = vec!["aa", "bb", "aa", "bb"];
let expected = vec![vec!["aa", "aa"], vec!["bb", "bb"]];

let results = clustr::cluster_strings(&inputs, 0.0, 4)?;
  
// Order of returned clusters nondeterministic
for e in expected {
    assert!(results.contains(&e));
}
Commit count: 33

cargo fmt