Crates.io | clusterphobia |
lib.rs | clusterphobia |
version | 0.1.0 |
source | src |
created_at | 2019-11-25 17:10:35.011522 |
updated_at | 2019-11-25 17:10:35.011522 |
description | Algorithms and data structures for unassisted clustering that employ the Hilbert Curve. |
homepage | |
repository | https://github.com/paulchernoch/clusterphobia |
max_upload_size | |
id | 184249 |
size | 2,062,435 |
This crate is based on ideas and algorithms originally developed in C# in this repository:
https://github.com/paulchernoch/HilbertTransformation
For an understanding of the theory, see SLASH.pdf in the clusterphobia
github repo:
https://github.com/paulchernoch/clusterphobia
Clusterphobia is intended to provide clustering tools for scalable, high-dimensional, unassisted, flat, single classification of data.
Two main approaches will be used to cluster data:
Once complete, the crate will have these main areas of functionality:
Points
suitable for the clustering algorithms. (The Point
structure is defined in the Hilbert crate and employs an optimized Euclidean distance formula, critical for nearest neighbor searches.)Point
data is analyzed (using the Hilbert curve) and the linkage distance and density threshold are derived. These values are crucial for the clustering algorithms. (Two points separated by less than the linkage distance are assumed to belong to the same cluster. If a set number of points fall inside a region whose radius does not exceed the density threshold, the Point
density is considered high enough to force a clustering.)Clustering
holds one Cluster
for each category in the partitioning scheme. It supports operations like adding points to clusters, recategorizing them, and merging clusters. The Clustering
is the principal output of the clustering process.Points
into a Clustering
.The following features are ready for use:
IntegerDataRange
and FloatDataRange
.Point
struct with optimized distance formula (from the hilbert crate)Clustering
struct which can be used to build and modify classification schemes.BCubed
struct which can represent a similarity measure and compute the similarity between two clusters (essential for unit tests and tuning).The literature is teaming with algorithms for cluster similarity. None handle all the edge cases perfectly. A measure that handles many commons use cases well and is not too expensive to run is called B-Cubed, named after initials taken from the names of its creators. Since initially proposed, others have published modifications to it.
The B-Cubed measure was proposed in this 1998 paper:
[1] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics - Volume 1, ACL โ98, pages 79โ85, 1998.
The following paper compared many clustering algorithms and found B-Cubed the best according to four formal constraints:
[2] A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints by Enrique Amigo, Julio Gonzalo, Javier Artiles, Felisa Verdejo of the Departamento de Lenguajes y Sistemas Informaticos UNED, Madrid, Spain, May 11, 2009
The above is available here: http://nlp.uned.es/docs/amigo2007a.pdf
A subsequent paper identified a use case where B-Cubed fared poorly: unbalanced datasets where one cluster dominates:
[3] Adapted B-CUBED Metrics to Unbalanced Datasets by Jose G. Moreno and Gaรซl Dias, both of Normandie University in France.
This third paper proposed a refined version of B-Cubed, but the added complexity adds significantly to processing time, so those refinements are not employed here. The definition of the algorithm used here is taken from section 2.1 of this last paper, where it combines the Precision and Recall values into a single number using the F-measure formula (a harmonic average). (The refined version is in section 2.2.)
๐ฝ = F-measure (final similarity measure)
โ = Precision (a measure of homogeneity)
โ = Recall (a measure of completeness)
ฮฑ = Weighting factor (defaults to 0.5)
โ = Number of points
k = Number of categories (varies between the ฯ and ฯ* Clusterings)
i = category index
ฯแตข = cluster solution for the ith category
ฯ*แตข= gold standard for the ith category
gโ = tests whether two items share the same category in the clustering
g*โ= tests whether two items share the same category in the gold standard
๐ ฮฑ ๐ - ฮฑ
โโโโโ โ โโโโโ + โโโโโ
๐ฝ โ โ
bยณ bยณ bยณ
k
โ ๐ โฒ ๐ โฒ โฒ
bยณ โ โโโ โณ โโโโโ โณ โณ g*โ(xโฑผ,xโ)
โ i=1 |ฯแตข| xโฑผโฯแตข xโโฯแตข
k
โ ๐ โฒ ๐ โฒ โฒ
bยณ โ โโโ โณ โโโโโ โณ โณ gโ(xโฑผ,xโ)
โ i=1 |ฯ*แตข| xโฑผโฯ*แตข xโโฯ*แตข
( ๐ โบ โl:xแตขโฯโ โง xโฑผโฯโ
gโ(xแตข,xโฑผ) โ <
( ๐, otherwise
( ๐ โบ โl:xแตขโฯ*โ โง xโฑผโฯ*โ
g*โ(xแตข,xโฑผ) โ <
( ๐, otherwise
If seeing the triply-nested summation symbols makes you balk and say, "that will take forever to run", that was my reaction as well. The inner two loops perform a sort of second moment computation, trying to see what proportion of items that share the same cluster according to one scheme share the matching clusters in the other scheme, and vice versa. Here is an example.
1 2 3 4
A A B B
โโโโโโโโโโโ
1 Aโ โโ โ
2 Aโ โโ โ
3 Bโ โโ โ
4 Bโ โโ โ
โโโโโโโโโโโ
In this example:
Measuring the effect of the second clustering on the first clustering's items consists in counting the checkmarks in the grid. The full B-Cubed measure normalizes the measure for each cluster by its cluster size.
The optimization that I found was that if two items end up in the same cluster, there will be four check marks, but if three match, there will be nine checks, etc. Thus if you had a list of all the categories in the second clustering that points map into and a count of how many items end up in each category, then you merely need to sum the squares of those counts.
I figured out another trick - you could do this in a single pass! Keeping a running sum of the squares of counts in each category, if you see a count was one and increase it to two, then you add the difference between one and four, which is three. If going from two to three, the increase in the sum of squares is 9 - 4 = 5. The difference is always 2 * previous_count + 1
.
Anyway, that is how I reduced a quadratic algorithm to a linear one.
The next functionality that will be implemented is Linkage analysis.