str-distance ===================== [![Build Status](https://travis-ci.com/mattsse/str-distance.svg?branch=master)](https://travis-ci.com/mattsse/str-distance) [![Crates.io](https://img.shields.io/crates/v/str-distance.svg)](https://crates.io/crates/str-distance) [![Documentation](https://docs.rs/str-distance/badge.svg)](https://docs.rs/str-distance) A crate to evaluate distances between strings (and others). Heavily inspired by the julia [StringDistances](https://github.com/matthieugomez/StringDistances.jl) ## Distance Metrics - [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) - [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) - [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) - [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) - Q-gram distances compare the set of all slices of length `q` in each str, where `q > 0` - QGram Distance `Qgram::new(usize)` - [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine::new(usize)` - [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard::new(usize)` - [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice::new(usize)` - [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap::new(usize)` - The crate includes distance "modifiers", that can be applied to any distance. - [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance. - [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically. - [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string. ## Usage ### The `str_distance::str_distance*` convenience functions. `str_distance` and `str_distance_normalized` take the two string inputs for which the distance is determined using the passed 'DistanceMetric`. `str_distance_normalized` evaluates the normalized distance between two strings. A value of '0.0' corresponds to the "zero distance", both strings are considered equal by means of the metric, whereas a value of '1.0' corresponds to the maximum distance that can exist between the strings. Calling the `str_distance::str_distance*` is just convenience for `DistanceMetric.str_distance*("", "")` #### Example Levenshtein metrics offer the possibility to define a maximum distance at which the further calculation of the exact distance is aborted early. **Distance** ```rust use str_distance::*; // calculate the exact distance assert_eq!(str_distance("kitten", "sitting", Levenshtein::default()), DistanceValue::Exact(3)); // short circuit if distance exceeds 10 let s1 = "Wisdom is easily acquired when hiding under the bed with a saucepan on your head."; let s2 = "The quick brown fox jumped over the angry dog."; assert_eq!(str_distance(s1, s2, Levenshtein::with_max_distance(10)), DistanceValue::Exceeded(10)); ``` **Normalized Distance** ```rust use str_distance::*; assert_eq!(str_distance_normalized("" , "", Levenshtein::default()), 0.0); assert_eq!(str_distance_normalized("nacht", "nacht", Levenshtein::default()), 0.0); assert_eq!(str_distance_normalized("abc", "def", Levenshtein::default()), 1.0); ``` ### The `DistanceMetric` trait ```rust use str_distance::{DistanceMetric, SorensenDice}; // QGram metrics require the length of the underlying fragment length to use for comparison. // For `SorensenDice` default is 2. assert_eq!(SorensenDice::new(2).str_distance("nacht", "night"), 0.75); ``` `DistanceMetric` was designed for `str` types, but is not limited to. Calculating distance is possible for all data types which are comparable and are passed as 'IntoIterator', e.g. as `Vec` ```rust use str_distance::{DistanceMetric, Levenshtein, DistanceValue}; assert_eq!(*Levenshtein::default().distance(&[1,2,3], &[1,2,3,4,5,6]),3); ``` ## Documentation Full docs available at [docs.rs](https://docs.rs/str-distance) ## References - [StringDistances](https://github.com/matthieugomez/StringDistances.jl) - [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo - [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) ## License Licensed under either of these: * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or https://www.apache.org/licenses/LICENSE-2.0) * MIT license ([LICENSE-MIT](LICENSE-MIT) or https://opensource.org/licenses/MIT)