| Crates.io | liblevenshtein |
| lib.rs | liblevenshtein |
| version | 0.8.0 |
| created_at | 2025-11-04 08:30:46.571692+00 |
| updated_at | 2025-12-21 05:26:09.240813+00 |
| description | Levenshtein/Universal Automata for approximate string matching using various dictionary backends |
| homepage | |
| repository | https://github.com/universal-automata/liblevenshtein-rust |
| max_upload_size | |
| id | 1915953 |
| size | 25,320,394 |
A Rust implementation of liblevenshtein for fast approximate string matching using Levenshtein Automata—deterministic finite automata that enable O(|W|) construction complexity for error-tolerant dictionary queries.
Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002).
This library provides efficient fuzzy string matching against large dictionaries using finite state automata. It supports multiple Levenshtein distance algorithms and pluggable dictionary backends.
Unlike naive approaches that compute Levenshtein distance for every dictionary word (O(|D| × |W| × |V|)), this library uses Levenshtein Automata—a deterministic finite automaton that accepts exactly all words within distance n from a query word.
Result: Query time is effectively independent of dictionary size for well-distributed tries!
Theory: Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002). See complete algorithm documentation for algorithmic details including Position structures, Subsumption relations, and Characteristic vectors.
Traditional approaches compute edit distance for every dictionary word, requiring O(|D| × |W| × |V|) operations. This library uses Levenshtein Automata—deterministic finite automata that accept exactly all words within distance n from query word W:
LEV_n(W) accepts L_Lev(n,W) = {V | d_L(W,V) ≤ n}
Position (i#e): Represents "matched i characters of W with e errors"
Subsumption (⊑): Eliminates redundant positions for minimal state sets
Characteristic Vectors (χ): Encode where characters appear in query word
Elementary Transitions: Move between positions based on dictionary character
Result: Automaton construction is O(|W|), dictionary traversal is O(|D|), regardless of dictionary size!
Core Algorithm:
Extensions:
Standard: Insert, delete, substitute operations (Chapters 4-6 of foundational paper)
Transposition: Adds adjacent character swaps (Chapter 7) - Damerau-Levenshtein distance
MergeAndSplit: Adds two-character ↔ one-character operations (Chapter 8)
pathmap-backend feature) - high-performance trie with structural sharing, supports dynamic updatespathmap-backend feature) - Character-level PathMap for Unicode fuzzy matchingOrderedQueryIterator returns results sorted by distance first, then lexicographicallyserialization feature):
compression feature) - 85% file size reductioncli feature):
RwLock-based interior mutabilitysimd feature, x86_64 only):
wasm feature):
wasm-phonetic feature combines WASM with phonetic rulesffi feature for native integrationphonetic-rules feature):
.llev format for phonetic rewrite rules with metadata and includes.llre format for regex-style patterns with NFA compilation*, +, ?, {n,m}), alternation, grouping[:alpha:], [:vowel:], etc.)Add to your Cargo.toml:
[dependencies]
liblevenshtein = "0.8"
# Or with SIMD acceleration (x86_64 only, requires SSE4.1/AVX2):
liblevenshtein = { version = "0.8", features = ["simd"] }
Or install the CLI tool:
cargo install liblevenshtein --features cli,compression,protobuf
Download pre-built packages from the GitHub Releases page:
.deb packages.rpm packages.pkg.tar.zst packages.dmg disk images for easy installation (x86_64 and ARM64).tar.gz and .zip archives for Linux and macOS (x86_64 and ARM64)use liblevenshtein::prelude::*;
// Create a dictionary from terms (using DoubleArrayTrie for best performance)
let terms = vec!["test", "testing", "tested", "tester"];
let dict = DoubleArrayTrie::from_terms(terms);
// Create a transducer with Standard algorithm
let transducer = Transducer::new(dict, Algorithm::Standard);
// Query for terms within edit distance 2
for term in transducer.query("tset", 2) {
println!("Match: {}", term);
}
// Query with distances
for candidate in transducer.query_with_distance("tset", 2) {
println!("Match: {} (distance: {})", candidate.term, candidate.distance);
}
Output:
Match: test
Match: tester
Match: test (distance: 1)
Match: tester (distance: 2)
For correct character-level Levenshtein distances with Unicode text, use the character-level dictionary variants:
use liblevenshtein::dictionary::double_array_trie_char::DoubleArrayTrieChar;
use liblevenshtein::prelude::*;
// Create a character-level dictionary for Unicode content
let terms = vec!["café", "naïve", "中文", "🎉"];
let dict = DoubleArrayTrieChar::from_terms(terms);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Character-level distance: "" → "¡" is distance 1 (one character)
// Byte-level would incorrectly report distance 2 (two UTF-8 bytes)
let results: Vec<_> = transducer.query("", 1).collect();
// Results include single-character Unicode terms
// Works with accented characters
for candidate in transducer.query_with_distance("cafe", 1) {
println!("{}: {}", candidate.term, candidate.distance);
// Output: café: 1 (one character substitution: e→é)
}
Character-level backends:
DoubleArrayTrieChar - Character-level Double-Array TriePathMapDictionaryChar (requires pathmap-backend feature) - Character-level PathMap with dynamic updatesWhen to use:
Performance: ~5% overhead for UTF-8 decoding, 4x memory for edge labels (char vs u8).
The DynamicDawg backend supports thread-safe runtime updates:
use liblevenshtein::prelude::*;
// Create dictionary with runtime update support
let dict = DynamicDawg::from_terms(vec!["cat", "dog"]);
let transducer = Transducer::new(dict.clone(), Algorithm::Standard);
// Insert new terms at runtime
dict.insert("cot");
dict.insert("bat");
// Queries immediately see the new terms
let results: Vec<_> = transducer.query("cot", 1).collect();
// Results: ["cat", "cot"]
// Remove terms
dict.remove("dog");
Thread Safety: The dictionary uses RwLock for interior mutability:
Transducer instances automatically see updatesPerformance Optimizations: Configure Bloom filter and auto-minimization for better performance:
use liblevenshtein::prelude::*;
// Enable Bloom filter for 10x faster contains() operations
let dict = DynamicDawg::with_config(
f32::INFINITY, // Auto-minimize threshold (disabled)
Some(10000), // Bloom filter capacity (enabled)
);
// Or enable auto-minimization for bulk insertions (30% faster for large datasets)
let dict = DynamicDawg::with_config(
1.5, // Auto-minimize at 50% growth
None, // Bloom filter (disabled)
);
// Or enable both optimizations
let dict = DynamicDawg::with_config(
1.5, // Auto-minimize threshold
Some(10000), // Bloom filter capacity
);
Optimization Guide:
contains())Note: PathMapDictionary also supports runtime updates but requires the optional pathmap-backend feature.
See examples/dynamic_dictionary.rs for a complete demonstration.
Store metadata with dictionary terms using value-mapped dictionaries:
use liblevenshtein::prelude::*;
use std::collections::HashSet;
// DynamicDawg with integer values (e.g., word frequencies)
let dict: DynamicDawg<u32> = DynamicDawg::new();
dict.insert_with_value("apple", 42);
dict.insert_with_value("apply", 17);
// Retrieve values
assert_eq!(dict.get_value("apple"), Some(42));
assert_eq!(dict.get_value("banana"), None);
// Use with FuzzyMap for fuzzy lookups with values
let map = FuzzyMap::new(dict, Algorithm::Standard);
let results = map.get_with_distance("aple", 1); // fuzzy lookup
// Results include both terms and their values
// Or use FuzzyMultiMap to aggregate multiple values
let dict: DynamicDawg<HashSet<u32>> = DynamicDawg::new();
dict.insert_with_value("test", HashSet::from([1, 2, 3]));
Supported backends:
DynamicDawg<V> - Dynamic dictionary with values of type VDynamicDawgChar<V> - Character-level dynamic dictionary with Unicode supportPathMapDictionary<V> - PathMap with values (requires pathmap-backend)PathMapDictionaryChar<V> - Character-level PathMap with valuesCommon value types:
u32 / u64 - Frequencies, scores, IDsHashSet<T> - Multiple associations per termVec<T> - Ordered collectionsClone + Send + Sync + 'staticIntegration with contextual completion:
use liblevenshtein::prelude::*;
// Use DynamicDawg backend for contextual completion
let dict: DynamicDawg<Vec<u32>> = DynamicDawg::new();
let engine = ContextualCompletionEngine::with_dictionary(
dict,
Algorithm::Standard
);
// Insert terms with context IDs
let ctx = engine.create_root_context();
engine.insert_finalized(ctx, "variable", vec![ctx]);
Get results sorted by edit distance first, then alphabetically:
use liblevenshtein::prelude::*;
let dict = DoubleArrayTrie::from_terms(vec!["apple", "apply", "ape", "app"]);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Results ordered by distance, then alphabetically
for candidate in transducer.query_ordered("aple", 1) {
println!("{}: {}", candidate.term, candidate.distance);
}
Output:
ape: 1
apple: 1
apply: 1
Filter results and enable prefix matching for code completion:
use liblevenshtein::prelude::*;
let dict = DoubleArrayTrie::from_terms(vec![
"getValue", "getVariable", "setValue", "setVariable"
]);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Prefix matching with filtering
for candidate in transducer
.query_ordered("getVal", 1)
.prefix() // Match terms starting with query ± edits
.filter(|c| c.term.starts_with("get")) // Only getter methods
{
println!("{}: {}", candidate.term, candidate.distance);
}
Output:
getValue: 0
getVariable: 1
See examples/code_completion_demo.rs and examples/contextual_filtering_optimization.rs for more examples.
The library provides multiple dictionary backends optimized for different use cases:
| Backend | Best For | Performance Highlights |
|---|---|---|
| DoubleArrayTrie | General use (recommended) | 3x faster queries, 30x faster contains, 8 bytes/state |
| DoubleArrayTrieChar | Unicode text, character-level distances | Correct Unicode semantics, ~5% overhead |
| PathMap | Dynamic updates, runtime modifications | Thread-safe insert/delete, fastest dynamic backend |
| PathMapChar | Unicode + dynamic updates | Character-level distances with runtime modifications |
| DAWG | Static dictionaries, moderate size | Good balance of speed and memory |
| OptimizedDawg | Static dictionaries, construction speed | 20-25% faster construction than DAWG |
| DynamicDawg | Occasional modifications | Best fuzzy matching for dynamic use |
| DynamicDawgChar | Unicode + occasional modifications | Character-level with insert/remove, ~5% overhead |
| SuffixAutomaton | Substring/infix matching | Find patterns anywhere in text |
Performance Comparison (10,000 words):
Construction: DAT: 3.2ms PathMap: 3.5ms DAWG: 7.2ms
Exact Match: DAT: 6.6µs DAWG: 19.8µs PathMap: 71.1µs
Contains (100): DAT: 0.22µs DAWG: 6.7µs PathMap: 132µs
Distance 1: DAT: 12.9µs DAWG: 319µs PathMap: 888µs
Distance 2: DAT: 16.3µs DAWG: 2,150µs PathMap: 5,919µs
Memory/state: DAT: ~8B OptDawg: ~13B DAWG: ~32B
Why DoubleArrayTrie is Fastest:
All backends implement the dictionary automaton A^D that is traversed in parallel with the Levenshtein automaton LEV_n(W) during queries. The choice of backend affects the constant factors in the O(|D|) query complexity, with DoubleArrayTrie providing the smallest constants due to hardware-friendly memory access patterns.
Prefix Matching Support: All backends except SuffixAutomaton support efficient prefix matching through the .prefix() method, making them ideal for code completion and autocomplete applications.
When to use each backend:
DoubleArrayTrie (best overall performance, default choice)DoubleArrayTrieChar (correct character-level distances)DynamicDawg (thread-safe runtime modifications)DynamicDawgChar (character-level with insert/remove, best fuzzy matching)PathMapChar (alternative, requires pathmap-backend)SuffixAutomaton (finds patterns anywhere in text)DoubleArrayTrie (8 bytes/state, most efficient)*Char) for correct Unicode semanticsSave and load dictionaries with optional compression:
use liblevenshtein::prelude::*;
use std::fs::File;
let dict = DoubleArrayTrie::from_terms(vec!["test", "testing", "tested"]);
// Save with compression (85% file size reduction)
let file = File::create("dict.bin.gz")?;
GzipSerializer::<BincodeSerializer>::serialize(&dict, file)?;
// Load compressed dictionary
let file = File::open("dict.bin.gz")?;
let dict: DoubleArrayTrie = GzipSerializer::<BincodeSerializer>::deserialize(file)?;
// Or use optimized Protobuf formats
let file = File::create("dict.pb")?;
OptimizedProtobufSerializer::serialize(&dict, file)?; // 62% smaller than standard
Requires serialization and compression features:
[dependencies]
liblevenshtein = { git = "https://github.com/universal-automata/liblevenshtein-rust", tag = "v0.8.0", features = ["serialization", "compression"] }
The library includes a full-featured command-line tool with interactive REPL:
# Install with CLI support (from GitHub)
cargo install --git https://github.com/universal-automata/liblevenshtein-rust --tag v0.8.0 \
--features cli,compression,protobuf liblevenshtein
# Or download pre-built binaries from GitHub Releases
# Query a dictionary
liblevenshtein query "test" --dict /usr/share/dict/words -m 2 -s
# Convert between formats with compression
liblevenshtein convert words.txt words.bin.gz \
--to-format bincode-gz --to-backend path-map
# Launch interactive REPL
liblevenshtein repl --dict words.bin.gz
The CLI auto-detects formats from file extensions and supports:
-gz compressed variants)See docs/developer-guide/building.md for comprehensive CLI documentation.
The FuzzyMultiMap enables fuzzy lookup of associated values, aggregating results from all matching terms:
use liblevenshtein::prelude::*;
use liblevenshtein::cache::multimap::FuzzyMultiMap;
use std::collections::HashSet;
// Create a dictionary with values (e.g., user IDs for each name)
let dict = PathMapDictionary::from_terms_with_values([
("alice", HashSet::from([101, 102])),
("alicia", HashSet::from([103])),
("bob", HashSet::from([201])),
("robert", HashSet::from([202, 203])),
]);
// Create fuzzy multimap
let fuzzy_map = FuzzyMultiMap::new(dict, Algorithm::Standard);
// Query "alise" - matches both "alice" and "alicia" at distance 1
let user_ids = fuzzy_map.query("alise", 1).unwrap();
// Returns: HashSet {101, 102, 103} - union of all matching values
// Practical use case: find all IDs for fuzzy name search
let ids = fuzzy_map.query("rob", 2).unwrap();
// Returns IDs for both "bob" (distance 1) and "robert" (distance 2)
Supported collection types:
HashSet<T> - Union of all matching setsBTreeSet<T> - Union of all matching setsVec<T> - Concatenation of all matching vectorsUse cases:
The ContextualCompletionEngine provides hierarchical scope-aware code completion with draft state management:
use liblevenshtein::contextual::ContextualCompletionEngine;
use liblevenshtein::transducer::Algorithm;
// Create engine (thread-safe, shareable with Arc)
let engine = ContextualCompletionEngine::with_algorithm(Algorithm::Standard);
// Create hierarchical scopes (global → function → block)
let global = engine.create_root_context(0);
let function = engine.create_child_context(1, global).unwrap();
let block = engine.create_child_context(2, function).unwrap();
// Add finalized terms to each scope
engine.finalize_direct(global, "std::vector").unwrap();
engine.finalize_direct(global, "std::string").unwrap();
engine.finalize_direct(function, "parameter").unwrap();
engine.finalize_direct(function, "result").unwrap();
// Incremental typing in block scope (draft state)
engine.insert_str(block, "local_var").unwrap();
// Query completions - sees all visible scopes (block + function + global)
let completions = engine.complete(block, "par", 1);
for comp in completions {
println!("{} (distance: {}, draft: {}, from context: {:?})",
comp.term, comp.distance, comp.is_draft, comp.contexts);
}
// Output: parameter (distance: 0, draft: false, from context: [1])
// Checkpoint/undo support for editor integration
engine.checkpoint(block).unwrap();
engine.insert_str(block, "iable").unwrap(); // Now: "local_variable"
engine.undo(block).unwrap(); // Restore to: "local_var"
// Finalize draft to add to dictionary
let term = engine.finalize(block).unwrap(); // Returns "local_var"
assert!(engine.has_term("local_var"));
// Query now sees finalized term
let results = engine.complete(block, "loc", 1);
// Returns: local_var (now finalized, visible in this context)
Features:
ArcPerformance (sub-millisecond for interactive use):
Use cases:
See examples/contextual_completion.rs for a complete example.
For efficient autocomplete and code completion scenarios, use the PrefixZipper trait to navigate directly to a prefix and iterate only matching terms. This is 5-10× faster than full dictionary iteration with filtering when the prefix matches a small subset of terms.
Performance: O(k) navigation + O(m) iteration, where k = prefix length, m = matching terms.
use liblevenshtein::prelude::*;
use liblevenshtein::dictionary::prefix_zipper::PrefixZipper;
use liblevenshtein::dictionary::double_array_trie_zipper::DoubleArrayTrieZipper;
let terms = vec!["process", "processUser", "produce", "product", "apple"];
let dict = DoubleArrayTrie::from_terms(terms.iter());
// Create zipper and navigate to prefix
let zipper = DoubleArrayTrieZipper::new_from_dict(&dict);
if let Some(iter) = zipper.with_prefix(b"proc") {
for (path, _zipper) in iter {
let term = String::from_utf8(path).unwrap();
println!("Found: {}", term);
// Prints only: "process" and "processUser" (not all 5 terms!)
}
}
Unicode support (character-level):
use liblevenshtein::dictionary::double_array_trie_char::DoubleArrayTrieChar;
use liblevenshtein::dictionary::double_array_trie_char_zipper::DoubleArrayTrieCharZipper;
use liblevenshtein::dictionary::prefix_zipper::PrefixZipper;
let terms = vec!["café", "cafétéria", "naïve"];
let dict = DoubleArrayTrieChar::from_terms(terms.iter());
let zipper = DoubleArrayTrieCharZipper::new_from_dict(&dict);
let prefix: Vec<char> = "caf".chars().collect();
if let Some(iter) = zipper.with_prefix(&prefix) {
for (path, _) in iter {
let term: String = path.iter().collect();
println!("Found: {}", term);
}
}
Valued dictionaries (for metadata like scope IDs, frequencies, etc.):
use liblevenshtein::dictionary::prefix_zipper::ValuedPrefixZipper;
let dict = DoubleArrayTrie::from_terms_with_values(
vec![("cat", 1), ("cats", 2), ("dog", 3)].into_iter()
);
let zipper = DoubleArrayTrieZipper::new_from_dict(&dict);
if let Some(iter) = zipper.with_prefix_values(b"cat") {
for (path, value) in iter {
let term = String::from_utf8(path).unwrap();
println!("{} -> {}", term, value);
// Prints: "cat -> 1", "cats -> 2"
}
}
Use cases:
Backend support: All dictionary backends support PrefixZipper (DoubleArrayTrie, DynamicDawg, PathMap, etc.) through blanket trait implementation.
For fine-grained control over traversal, use the zipper API directly:
use liblevenshtein::dictionary::pathmap::PathMapDictionary;
use liblevenshtein::dictionary::pathmap_zipper::PathMapZipper;
use liblevenshtein::transducer::{AutomatonZipper, Algorithm, StatePool};
use liblevenshtein::transducer::intersection_zipper::IntersectionZipper;
// Create dictionary
let dict = PathMapDictionary::<()>::new();
dict.insert("cat");
dict.insert("cats");
dict.insert("dog");
// Create zippers
let dict_zipper = PathMapZipper::new_from_dict(&dict);
let auto_zipper = AutomatonZipper::new("cot".as_bytes(), 1, Algorithm::Standard);
// Intersect dictionary and automaton
let intersection = IntersectionZipper::new(dict_zipper, auto_zipper);
// Manual traversal
let mut pool = StatePool::new();
for (label, child) in intersection.children(&mut pool) {
if child.is_match() {
println!("Match: {} (distance: {})",
child.term(),
child.distance().unwrap());
}
// Recurse into child for custom traversal algorithms
for (_, grandchild) in child.children(&mut pool) {
// Custom logic here...
}
}
When to use:
Note: Most users should use Transducer::query() or ContextualCompletionEngine instead. The zipper API is lower-level and requires manual state management.
For performance optimization when querying dictionaries with scoped values:
use liblevenshtein::prelude::*;
use std::collections::HashSet;
// Dictionary with scope IDs
let dict = PathMapDictionary::from_terms_with_values([
("global_var", 0), // Scope 0 (global)
("function_param", 1), // Scope 1 (function)
("local_var", 2), // Scope 2 (block)
]);
let transducer = Transducer::new(dict, Algorithm::Standard);
// Query only specific scopes (e.g., visible from scope 1)
let visible_scopes = HashSet::from([0, 1]); // global + function
for term in transducer.query_by_value_set("param", 1, &visible_scopes) {
println!("{}", term);
}
// Output: function_param (scope 1 is visible)
// Does NOT return: local_var (scope 2 not in visible set)
// Or use custom predicate
for term in transducer.query_filtered("var", 1, |scope_id| *scope_id <= 1) {
println!("{}", term);
}
// Returns: global_var, function_param (scopes 0, 1)
Performance benefit: Filtering by value during traversal is significantly faster than post-filtering results, especially for large dictionaries with many out-of-scope matches.
Use cases:
The phonetic-rules feature provides mathematically verified phonetic transformation rules for fuzzy matching. Compose phonetic NFAs with Levenshtein automata for approximate string matching without normalization.
Use llre! and llev! macros to compile patterns and rules at build time (NFA embedded in binary):
use liblevenshtein::{llre, llev, llre_file, llev_file};
// Compile regex pattern at build time - NFA embedded in binary
let pattern = llre!(r"(ph|f)one");
assert!(pattern.matches("phone"));
assert!(pattern.matches("fone"));
// Compile LLev rules at build time
let rules = llev!(r#"
ph -> f; // phone → fone
gh -> / [:vowel:]_; // night → nit (silent gh after vowel)
c -> s / _[:front_vowel:]; // city → sity
"#);
// Load from files (imports resolved at compile time)
let phonetic = llre_file!("patterns/phonetic.llre");
let english = llev_file!("rules/english.llev");
Build fuzzy regex matchers by composing phonetic NFAs with Levenshtein automata:
use liblevenshtein::phonetic::nfa::{compile, ProductAutomatonChar};
use liblevenshtein::phonetic::regex::parse;
// Step 1: Parse phonetic regex pattern
let regex = parse("(ph|f)one")?;
// Step 2: Compile regex to NFA
let nfa = compile(®ex)?;
// Step 3: Compose NFA with Levenshtein automaton (max distance 2)
let product = ProductAutomatonChar::new(nfa, 2);
// Step 4: Fuzzy match without preprocessing
assert!(product.accepts("phone")); // exact match (distance 0)
assert!(product.accepts("fone")); // exact match (distance 0)
assert!(product.accepts("phones")); // distance 1 (insertion)
assert!(product.accepts("phon")); // distance 1 (deletion)
assert!(product.accepts("phome")); // distance 1 (substitution)
// Get minimum edit distance
assert_eq!(product.min_distance("phone"), Some(0));
assert_eq!(product.min_distance("fon"), Some(1));
assert_eq!(product.min_distance("xyz"), None); // outside budget
Combine pre-compiled English phonetic rules and search a dictionary for correction candidates:
use liblevenshtein::phonetic::rules::english;
use liblevenshtein::phonetic::llev::RuleSetChar;
use liblevenshtein::phonetic::verified::rules_to_nfa_char;
use liblevenshtein::phonetic::nfa::ProductAutomatonChar;
use liblevenshtein::dictionary::DynamicDawgChar;
// Load pre-compiled English phonetic rules
let zompist = english::zompist(); // 62 rules: ph→f, gh→∅, tion→ʃən
let homophones = english::homophones(); // too/two→to, their/there→ther
let text_speak = english::text_speak(); // u→you, 2→to, thx→thanks
// Combine all rules into a new RuleSetChar
let mut combined = RuleSetChar::new();
combined.merge(zompist.clone());
combined.merge(homophones.clone());
combined.merge(text_speak.clone());
// Convert combined rules to NFA for fuzzy matching (no normalization)
let rules_nfa = rules_to_nfa_char(&combined.rules);
let product = ProductAutomatonChar::new(rules_nfa, 1);
// Build a dictionary of terms to search
let mut dictionary = DynamicDawgChar::new();
dictionary.insert("phone");
dictionary.insert("fone");
dictionary.insert("today");
dictionary.insert("knight");
// Query for correction candidates by filtering dictionary terms
let query = "fone";
let mut candidates: Vec<(&str, u8)> = dictionary
.iter()
.filter_map(|term| {
product.min_distance(term).map(|dist| (term, dist))
})
.collect();
// Sort by distance (best matches first)
candidates.sort_by_key(|(_, dist)| *dist);
// candidates = [("fone", 0), ("phone", 0), ...]
// Optional: Apply rules to normalize text
let normalized = combined.apply("phone"); // "fone" (ph→f)
let normalized = combined.apply("knight"); // "nit" (silent k, gh→∅)
use liblevenshtein::transducer::Algorithm;
// Standard Levenshtein
let standard = ProductAutomatonChar::new(nfa.clone(), 1);
// Transposition-aware (character swaps count as 1 edit)
let transposition = ProductAutomatonChar::with_algorithm(
nfa.clone(), 1, Algorithm::Transposition
);
// Merge-and-split (for OCR: "cl"→"d", "ä"→"ae")
let merge_split = ProductAutomatonChar::with_algorithm(
nfa, 1, Algorithm::MergeAndSplit
);
For quick on-the-fly matching without building a dictionary:
use liblevenshtein::phonetic::grep::PhoneticGrep;
// Quick on-the-fly matching
let grep = PhoneticGrep::from_pattern("phone", 1)?;
assert!(grep.matches("phone").is_some());
assert!(grep.matches("fone").is_some());
// With phonetic flags (case + accent insensitive)
let grep = PhoneticGrep::from_pattern("(?ia:cafe)", 1)?;
assert!(grep.matches("CAFÉ").is_some());
// With rules from file
let grep = PhoneticGrep::with_rules("fone", Path::new("english.llev"), 1)?;
assert!(grep.matches("phone").is_some());
Named Character Classes (phonetic-aware):
// [:vowel:] includes ASCII vowels + IPA vowels (ə, ɪ, ʊ, ɛ, etc.)
// [:consonant:] includes ASCII consonants + IPA symbols
// [:fricative:] matches f,v,s,z,h + digraphs (sh,th,zh)
// [:nasal:] matches m,n + ng digraph + IPA ŋ
// [:stop:] / [:plosive:] matches p,b,t,d,k,g
// [:front_vowel:] / [:back_vowel:] by tongue position
// [:voiced:] / [:voiceless:] by voicing
Formal Verification: All algorithms are proven correct in Coq/Rocq:
Phonetic Rules (5 theorems):
Levenshtein Automata:
Verification Artifacts:
docs/verification/phonetic/src/phonetic/src/phonetic/properties.rsbenches/phonetic_rules.rsexamples/phonetic_rewrite.rsEnable with:
[dependencies]
liblevenshtein = { git = "https://github.com/universal-automata/liblevenshtein-rust", features = ["phonetic-rules"] }
Source: Based on Zompist's English spelling rules with complete formal verification in Coq/Rocq (630+ lines of proofs, 100% proven, zero Admitted).
For complex phonetic matching, use .llev files for rewrite rules or .llre files for regex-style patterns with phonetic extensions:
use liblevenshtein::phonetic::llev::parse_str;
use liblevenshtein::phonetic::llre;
// LLev: Rewrite rules with phonetic context and named classes
let rules = parse_str(r#"
@name "English Phonetic Rules"
ph -> f; // phone → fone
c -> s / _[:front_vowel:]; // city → sity (before e,i)
gh -> / [:vowel:]_; // night → nit (silent after vowel)
"#)?;
// LLRE: Compile regex pattern to NFA
let pattern = llre::compile_pattern("[:fricative:]one")?;
assert!(pattern.matches("fone")); // f ∈ [:fricative:]
assert!(pattern.matches("shone")); // sh ∈ [:fricative:]
// Compose LLRE pattern with Levenshtein for fuzzy matching
use liblevenshtein::phonetic::nfa::ProductAutomatonChar;
let product = ProductAutomatonChar::new(pattern.nfa.clone(), 1);
assert!(product.accepts("phone")); // distance 1 from pattern
See examples/phonetic_spellcheck for a complete demo, and the LLev Grammar / LLRE Grammar for full syntax reference.
For developers wanting to understand the implementation or extend the algorithms:
| Paper Concept | Code Location | Description |
|---|---|---|
| Position (i#e) | src/transducer/position.rs:11-35 |
Index + error count structure |
| Subsumption (⊑) | src/transducer/position.rs:231-269 |
Position redundancy elimination |
| Characteristic Vector (χ) | src/transducer/position.rs |
Character occurrence encoding |
| Elementary Transitions (δ) | src/transducer/transition.rs:119-438 |
Table 4.1, 7.1, 8.1 from paper |
| State Transitions (Δ) | src/transducer/query.rs |
Parallel traversal implementation |
| Algorithm Variants | src/transducer/algorithm.rs |
Standard/Transposition/MergeAndSplit |
| Imitation Method | src/transducer/query.rs:86-188 |
Chapter 6 - on-demand state generation |
| Dictionary Automaton (A^D) | src/dictionary/ |
Multiple backend implementations |
Key Implementation Patterns:
See Implementation Mapping for detailed code-to-paper correspondence with examples.
The library maintains the theoretical guarantees from the foundational paper while adding practical optimizations:
vs Naive Approach: Computing Levenshtein distance for every dictionary entry requires O(|D| × |W| × |V|) operations. This library achieves 100-1000× speedup on large dictionaries by avoiding per-word distance calculations through automata-guided search.
Beyond the algorithmic foundations, the implementation includes:
Benchmarks show 3.3x speedup for DAWG operations and 5-18% improvements across filtering/prefix scenarios.
simd feature)When compiled with the simd feature on x86_64 platforms, the library achieves 20-64% performance gains through:
Key optimizations:
Performance improvements vary by workload:
See docs/analysis/ for detailed SIMD performance analysis (950+ lines of documentation).
Character-level dictionary variants (*Char) for Unicode support:
char vs 1 byte per u8)When to use:
*Char variants for multi-language dictionaries with non-ASCII UnicodeDoubleArrayTrie, PathMapDictionary) for ASCII/Latin-1 contentWhen using the phonetic-rules feature:
This library includes extensive research documentation on potential enhancements:
Support for restricted substitutions where only specific character pairs can be substituted:
Use Cases:
Theory: Based on "Universal Levenshtein Automata. Building and Properties" (Mitankin, Mihov, Schulz, 2005). Maintains O(|W|) construction complexity with restricted substitution set S ⊆ Σ × Σ.
Status: Research complete, implementation planned. See Universal Levenshtein Automata Documentation for complete details.
Support for variable operation costs through discretization:
Use Cases:
Complexity: O(|W| × max_cost/precision) through cost discretization—still linear in query length when cost range and precision are fixed.
Status: Methodology documented, implementation research phase. See Weighted Levenshtein Automata Research for feasibility analysis and implementation approach.
Interested in implementing these features or researching new extensions? See:
The research documentation provides complete theoretical foundations and implementation roadmaps for extending the library.
Licensed under the Apache License, Version 2.0. See LICENSE for details.