Crates.io | gzip-cmp |
lib.rs | gzip-cmp |
version | 0.1.0 |
source | src |
created_at | 2023-09-08 20:30:12.451376 |
updated_at | 2023-09-08 20:30:12.451376 |
description | This is a library that makes a distance measurement between binary data based on the difference of the compressed data length |
homepage | |
repository | |
max_upload_size | |
id | 967757 |
size | 21,352 |
Zip-Dist is a library and program that compares binary data using the
compression length as a distance metric. The basic idea is to compare the
lengths of C(ab)
vs C(ac)
to determine if a is closer to b
or c
.
// - taken from: '“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
// - source: https://aclanthology.org/2023.findings-acl.426.pdf
fn distance(a: &[u8], b: &[u8]) -> f64 {
let mut ab = Vec::new();
ab.extend_from_slice(a);
ab.extend_from_slice(b);
let la = compressed_bytes(a);
let lb = compressed_bytes(b);
let lab = compressed_bytes(&ab);
((lab - la.min(lb)) as f64) / ((la.max(lb)) as f64)
}
Currently the main application reads all files in a directory (text or binary) and tries to make clusters of those files by building a MST and visiting that MST breaking the edges that have a weight that's higher than a threshold.
This is only an approach that I found to work well but are many other ways to go about this. In the paper that I used as reference and inspiration, k-means is used to classify data. It's also important to note that this approach is very simple and agnostic to the type of data that's fed to it.