Crates.io | textdistance |
lib.rs | textdistance |
version | 1.1.0 |
source | src |
created_at | 2022-05-10 11:19:53.877434 |
updated_at | 2024-09-11 06:30:29.823272 |
description | Lots of algorithms to compare how similar two sequences are |
homepage | |
repository | https://github.com/life4/textdistance.rs |
max_upload_size | |
id | 583911 |
size | 172,453 |
[ github.com ] [ docs.rs ] [ crates.io ]
Rust library with lots of different algorithms to compare how similar two sequences are.
Features:
EntropyNCD
and Sift4
.#![no_std]
support (embedded systems).Edit-based:
DamerauLevenshtein
, both optimal string alignment and restricted.Hamming
Jaro
JaroWinkler
Levenshtein
Sift4Common
Sift4Simple
SmithWaterman
Token-based:
Bag
Cosine
(aka Orchini, Tucker, OtsukaβOchiai)EntropyNCD
(Entropy-based Normalized Compression Distance)Jaccard
(aka Tanimoto, Critical Success Index)Overlap
(aka SzymkiewiczβSimpson)Roberts
SorensenDice
(aka F1, Czekanowski, Zijdenbos)Tversky
Sequence-based:
LCSSeq
(Longest Common SubSequence)LCSStr
(Longest Common SubString)RatcliffObershelp
(aka Gestalt pattern matching)Naive:
Prefix
Suffix
Length
Normalization for other metrics:
LIG3
normalization for Hamming
by Levenshtein
MLIPNS
normalization for Hamming
YujianBo
normalization for Levenshtein
cargo add textdistance
Or if you're going to use it in a no_std project:
cargo add --no-default-features textdistance
The textdistance::str
module provides shortcut functions for each algorithm for calculating the distance/similarity between two strings:
use textdistance::str::damerau_levenshtein;
assert!(damerau_levenshtein("abc", "acbd") == 2);
The textdistance::nstr
module is the same but all algorithms return a normalized value (between 0.0 and 1.0):
use textdistance::nstr::damerau_levenshtein;
assert!(damerau_levenshtein("abc", "acbd") == 2./4.);
For more advanced usage, each algorithm is provided as a struct implementing the Algorithm
trait:
use textdistance::{Algorithm, DamerauLevenshtein};
let a = DamerauLevenshtein::default();
let r = a.for_str("abc", "acbd");
assert!(r.val() == 2);
assert!(r.nval() == 2./4.);
Algorithm
trait provides for_str
, for_vec
, and for_iter
to calculate the result for two strings, vectors (slices), or iterators respectively. In addition, there are for_words
and for_bigrams
methods that split the text into words or bigrams respectively before calculating the distance.textdistance::Result
that provides methods to get absolute (val
) or normalized (nval
) value of the metric, distance (dist
and ndist
), or similarity (sim
and nsim
).The for_str
method (and so all functions in the str
and nstr
modules) uses String.chars
to split the string and then runs it through the for_iter
method. So, Γ©
will be considered two distinct characters ("latin small letter e" and "combining acute accent"). Usually, that's ok and this is how Python works. You can read more in the official Rust documentation.
If you want Γ©
to be considered as a single symbol, use the unicode-segmentation crate:
use textdistance::{Algorithm, DamerauLevenshtein};
use unicode_segmentation::UnicodeSegmentation;
let s1 = "aΜéâ̲\r\n";
let s2 = "Γ©aΜΓΆΜ²\r\n";
let g1 = s1.graphemes(true);
let g2 = s2.graphemes(true);
let a = DamerauLevenshtein::default();
let r = a.for_iter(g1, g2);
assert!(r.val() == 1);
The algorithm to use depends on your use case. First, you need to decide on the algorithm category:
If you go with edit-based, the next thing is to decide what kind of changes you need to detect:
alg | sub | add | del | trans |
---|---|---|---|---|
Hamming |
β | β | β | β |
Jaro |
β | β | β | β |
JaroWinkler |
β | β | β | β |
Sift4 |
β | β | β | β |
Levenshtein |
β | β | β | β |
DamerauLevenshtein |
β | β | β | β |
Hamming
is the fastest one but detects only substitutions.Sift4
is very fast but not as well-known and battle-tested as other algorithms.Jaro
is slower than Sift4
but well-known and battle-tested.JaroWinkler
is like Jaro
but gives more weight to strings with a matching prefix.Levenshtein
detects everything but transpositions and faster than DamerauLevenshtein
(but slower than other algorithms).DamerauLevenshtein
ticks all the boxes but very slow.There are some use cases:
DamerauLevenshtein
with some optimizations is used in cargo to correct typos in command names.Jaro
is included in the Elixir standard library (String.jaro_distance). It is used by the compiler and by mix (cargo for Elixir) to provide the "did you mean?" functionality for typos in module or command names.RatcliffObershelp
variation is included in the Python standard library (difflib.SequenceMatcher).Legend:
Edit-based (and their normalizations):
algorithm | time |
---|---|
hamming | π 19.203 Β΅s |
mlipns | π 20.625 Β΅s |
sift4_simple | π 143.69 Β΅s |
sift4_common | π 238.86 Β΅s |
jaro | π’ 1.7148 ms |
jaro_winkler | π’ 1.7174 ms |
levenshtein | π’ 4.5999 ms |
yujian_bo | π’ 4.6044 ms |
lig3 | π 6.5563 ms |
smith_waterman | π 9.5146 ms |
damerau_levenshtein_restricted | π 10.301 ms |
damerau_levenshtein | π 41.938 ms |
Token-based:
algorithm | time |
---|---|
cosine | π 508.59 Β΅s |
sorensen_dice | π 510.75 Β΅s |
tversky | π 512.41 Β΅s |
overlap | π 513.76 Β΅s |
bag | π 523.06 Β΅s |
jaccard | π 580.79 Β΅s |
roberts | π 714.79 Β΅s |
entropy_ncd | π 731.68 Β΅s |
Sequence-based:
algorithm | time |
---|---|
lcsstr | π’ 3.2658 ms |
lcsseq | π 7.4349 ms |
ratcliff_obershelp | π 36.308 ms |
Naive:
algorithm | time |
---|---|
length | π 2.5300 Β΅s |
prefix | π 22.473 Β΅s |
suffix | π 38.821 Β΅s |
The benchmarks are powered by criterion and live in the benches directory. They are quite simple: grab 10 open-source licenses, take a 200 chars prefix from each, and cross-compare these prefixes. The numbers might be very different for a different kind of input, length of the input, when comparing words rather than characters, or running the benchmarks on a different machine. The goal of these benchmarks is to provide basic guidance rather than give a definitive answer. If performance is critical for your application, I encourage you to make your benchmarks on the real data you have.
We stick to SemVer:
Algorithm
trait. Hence metrics that have additional limitations on the input sequence elements beyond Eq
(like Editex and MRA that work only with ASCII letters) currently cannot be implemented.There are the libraries that I used as a reference implementation and the source of test cases:
Specials thanks to Trevor Gross for transferring to me the ownership of the textdistance name on crates.io.
To run everything locally, all you need is Rust, Python, and task. Execute task all
to run all code formatters, linters, and tests.
Thank you β€οΈ