dictutils

Crates.io	dictutils
lib.rs	dictutils
version	0.1.2
created_at	2025-12-04 12:03:13.303736+00
updated_at	2025-12-06 14:59:22.520983+00
description	Dictionary utilities for Mdict and other formats
homepage	https://github.com/SuperKevinZhou/dictutils
repository	https://github.com/SuperKevinZhou/dictutils
max_upload_size
id	1966411
size	1,039,126

SuperKevinZhou (SuperKevinZhou)

documentation

README

DictUtils

A high-performance Rust library for fast dictionary operations with support for multiple dictionary formats (MDict, StarDict, ZIM) and advanced indexing capabilities.

⚠️ Experimental Status

DictUtils is currently experimental and not suitable for production use. Many format parsers rely on placeholder logic that does not validate real dictionary files, index sidecars are not compatible with production dictionaries, and compression/IO helpers are best-effort prototypes. Use this crate only for prototyping or research experiments. Contributions are welcome to replace the mock parsing layers with real format support.

✨ Features

🚀 High Performance: B-TREE indexing and memory-mapped files for optimal speed
📚 Multi-Format Support: MDict, StarDict, and ZIM dictionary formats
🔍 Advanced Search: Prefix, fuzzy, and full-text search capabilities
⚡ Concurrent Access: Thread-safe operations with parallel processing
💾 Memory Efficient: LRU caching and lazy loading
🛠️ Flexible Configuration: Customizable cache sizes, indexing options, and more

🚀 Quick Start

Add DictUtils to your Cargo.toml:

[dependencies]
dictutils = "0.1.0"

Or with optional features:

[dependencies]
dictutils = { version = "0.1.0", features = ["criterion", "rayon", "cli", "encoding-support"] }

Basic usage example:

use dictutils::prelude::*;

fn main() -> dictutils::Result<()> {
    // Load dictionary with auto-detection
    let loader = DictLoader::new();
    let mut dict = loader.load("path/to/dictionary.mdict")?;
    
    // Basic lookup
    let entry = dict.get(&"hello".to_string())?;
    println!("Found: {}", String::from_utf8_lossy(&entry));
    
    // Prefix search
    let results = dict.search_prefix("hel", Some(10))?;
    for result in results {
        println!("Found: {}", result.word);
    }
    
    Ok(())
}

📖 Documentation

Core Concepts

Dictionary Loading

// Auto-detection of dictionary format
let mut dict = DictLoader::new().load("dictionary.mdict")?;

// With custom configuration
let config = DictConfig {
    load_btree: true,        // Enable B-TREE indexing
    load_fts: true,          // Enable full-text search
    use_mmap: true,          // Memory mapping for large files
    cache_size: 1000,        // Entry cache size
    batch_size: 100,         // Batch operation size
    ..Default::default()
};

let loader = DictLoader::with_config(config);
let mut dict = loader.load("large_dictionary.zim")?;

Search Operations

use dictutils::traits::*;

// Prefix search - find words starting with "comp"
let prefix_results = dict.search_prefix("comp", Some(20))?;

// Fuzzy search - find words similar to "programing"
let fuzzy_results = dict.search_fuzzy("programing", Some(2))?;

// Full-text search - search within content
let fts_iterator = dict.search_fulltext("programming language")?;
let fts_results: Vec<_> = fts_iterator.collect()?;

// Range queries
let range_results = dict.get_range(100..200)?;

// Batch lookups
let keys = vec!["hello".to_string(), "world".to_string(), "rust".to_string()];
let batch_results = dict.get_batch(&keys, Some(50))?;

Performance Optimization

// Build indexes for better performance
dict.build_indexes()?;

// Configure for memory efficiency
let efficient_config = DictConfig {
    use_mmap: true,      // Better for large files
    cache_size: 500,     // Smaller cache for memory-constrained environments
    load_btree: true,    // Fast lookups
    load_fts: false,     // Disable if not needed
    ..Default::default()
};

// Monitor performance statistics
let stats = dict.stats();
println!("Memory usage: {} bytes", stats.memory_usage);
println!("Cache hit rate: {:.2}%", stats.cache_hit_rate * 100.0);
for (index_name, size) in &stats.index_sizes {
    println!("{} index: {} bytes", index_name, size);
}

🏗️ Architecture

Core Components

dictutils/
├── traits.rs          # Core trait definitions
├── dict/              # Dictionary format implementations
│   ├── mdict.rs      # Monkey's Dictionary format
│   ├── stardict.rs   # StarDict format
│   └── zimdict.rs    # ZIM format
├── index/             # High-performance indexing
│   ├── btree.rs      # B-TREE index for fast lookups
│   └── fts.rs        # Full-text search index
├── util/              # Utility modules
│   ├── compression.rs # Compression algorithms
│   ├── encoding.rs    # Text encoding conversion
│   └── buffer.rs      # Binary buffer utilities
└── lib.rs            # Main library module

Design Principles

Performance First: Optimized for speed with efficient data structures
Memory Efficiency: Lazy loading, caching, and memory mapping
Thread Safety: All operations are thread-safe by default
Format Agnostic: Unified interface across different dictionary formats
Extensible: Easy to add new dictionary formats and features

📊 Performance Guide

Dictionary Size Recommendations

Dictionary Size	Configuration	Memory Mapping	Indexes
< 10MB	Basic config	Optional	Optional
10MB - 100MB	Standard	Recommended	B-TREE
100MB - 1GB	Optimized	Recommended	B-TREE + FTS
> 1GB	Enterprise	Required	B-TREE + FTS

Performance Tips

1. Index Optimization

// Build B-TREE index for fast exact lookups
dict.build_indexes()?;

// Enable memory mapping for better I/O performance
let config = DictConfig {
    use_mmap: true,
    ..Default::default()
};

// Cache frequently accessed entries
let config = DictConfig {
    cache_size: 2000,  // Increase cache size
    ..Default::default()
};

2. Search Optimization

// Use batch operations for multiple lookups
let keys = vec!["word1".to_string(), "word2".to_string(), /* ... */];
let results = dict.get_batch(&keys, Some(100))?;

// Cache search results
let mut cache = HashMap::new();

// Prefix search with limits
let results = dict.search_prefix("prefix", Some(100))?;

// Use appropriate search type
if query.len() <= 3 {
    dict.search_prefix(query, limit);  // Fast for short prefixes
} else if query.contains(" ") {
    dict.search_fulltext(query)?;     // For phrases
} else {
    dict.search_fuzzy(query, Some(2))?; // For typo tolerance
}

3. Memory Optimization

// Use memory mapping for large files
let config = DictConfig {
    use_mmap: true,
    ..Default::default()
};

// Clear cache periodically
dict.clear_cache();

// Monitor memory usage
let stats = dict.stats();
println!("Memory usage: {} bytes", stats.memory_usage);

Benchmarking

Run performance benchmarks:

# Run all benchmarks
cargo bench --all-features

# Run specific benchmark category
cargo bench --features criterion -- dict_lookup

# Profile memory usage
cargo run --features criterion --example performance_profiling

Expected performance characteristics:

Dictionary Loading: 10-100ms for dictionaries < 100MB
Exact Lookup: < 1ms with B-TREE index
Prefix Search: < 10ms for 1000 results
Fuzzy Search: < 100ms for 100 results
Full-Text Search: < 50ms for 100 results

🔧 Advanced Usage

Concurrent Access

use std::sync::{Arc, Mutex};
use std::thread;

// Share dictionary across threads
let dict = Arc::new(dict);

// Thread 1: Reading operations
let dict1 = Arc::clone(&dict);
let handle1 = thread::spawn(move || {
    let entry = dict1.get(&"hello".to_string())?;
    println!("Found: {}", String::from_utf8_lossy(&entry));
    Ok::<(), dictutils::DictError>(())
});

// Thread 2: Search operations
let dict2 = Arc::clone(&dict);
let handle2 = thread::spawn(move || {
    let results = dict2.search_prefix("test", Some(10))?;
    println!("Found {} results", results.len());
    Ok::<(), dictutils::DictError>(())
});

handle1.join().unwrap().unwrap();
handle2.join().unwrap().unwrap();

Custom Dictionary Processing

// Process large dictionaries efficiently
fn process_large_dictionary(dict_path: &str) -> dictutils::Result<()> {
    let loader = DictLoader::new();
    let mut dict = loader.load(dict_path)?;
    
    // Build indexes for better performance
    dict.build_indexes()?;
    
    // Process entries in batches
    let iterator = dict.iter()?;
    let mut batch = Vec::new();
    let batch_size = 1000;
    
    for entry_result in iterator {
        match entry_result {
            Ok((key, value)) => {
                batch.push((key, value));
                
                if batch.len() >= batch_size {
                    process_batch(&batch)?;
                    batch.clear();
                }
            }
            Err(e) => {
                println!("Error processing entry: {}", e);
            }
        }
    }
    
    // Process remaining entries
    if !batch.is_empty() {
        process_batch(&batch)?;
    }
    
    Ok(())
}

Format Conversion

// Convert between dictionary formats
use dictutils::dict::{BatchOperations, DictFormat};

fn convert_dictionary(source: &str, destination: &str, target_format: &str) -> dictutils::Result<()> {
    let loader = DictLoader::new();
    let mut source_dict = loader.load(source)?;
    
    // Extract all entries
    let entries: Vec<(String, Vec<u8>)> = source_dict.iter()
        .collect::<Result<Vec<_>, _>>()?;
    
    // Create new dictionary in target format
    // Note: This would require a DictBuilder implementation
    // For now, create the new dictionary manually
    
    match target_format {
        "mdict" => {
            // Create MDict file with extracted entries
            println!("Converting to MDict format with {} entries", entries.len());
        }
        "stardict" => {
            // Create StarDict file with extracted entries  
            println!("Converting to StarDict format with {} entries", entries.len());
        }
        _ => {
            return Err(DictError::UnsupportedOperation(
                format!("Target format '{}' not supported", target_format)
            ));
        }
    }
    
    Ok(())
}

🔍 Dictionary Formats

MDict (Monkey's Dictionary)

High-performance binary format with:

B-TREE indexing for fast lookups
Memory-mapped file access
Compression support (GZIP, LZ4, Zstandard)
Custom metadata fields

Best for: Large dictionaries, performance-critical applications

StarDict

Classic format with:

Binary search support
Synonym and mnemonic files
Cross-platform compatibility
Simple text-based format
Enhanced DICTZIP handling: random-access via RA tables or deterministic sequential inflation when RA is missing

Best for: General purpose dictionaries, simple implementations

ZIM

Wikipedia offline format with:

Article-based storage
Built-in compression
Rich metadata support
Efficient for encyclopedia content

Best for: Offline wikis, reference materials

Babylon (BGL)

Babylon format with:

Sidecar index support
Memory-mapped file access
Requires external indexing tools

Important: The BGL implementation does NOT parse raw .bgl binaries directly. It requires externally built sidecar index files (.btree and .fts) that must be provided by an external tool like GoldenDict's indexer. The BGL parser only consumes these pre-built indexes and does not implement raw BGL binary parsing.

Best for: Babylon dictionaries with pre-built indexes

🚨 Error Handling

All operations return Result<T, DictError>:

use dictutils::traits::{DictError, Result};

fn robust_dict_operation() -> Result<()> {
    let loader = DictLoader::new();
    
    match loader.load("dictionary.mdict") {
        Ok(mut dict) => {
            match dict.get(&"example".to_string()) {
                Ok(entry) => {
                    println!("Found: {}", String::from_utf8_lossy(&entry));
                }
                Err(DictError::IndexError(msg)) => {
                    println!("Word not found: {}", msg);
                }
                Err(e) => {
                    println!("Lookup error: {}", e);
                }
            }
        }
        Err(DictError::FileNotFound(path)) => {
            println!("Dictionary file not found: {}", path);
        }
        Err(DictError::InvalidFormat(msg)) => {
            println!("Invalid dictionary format: {}", msg);
        }
        Err(DictError::IoError(msg)) => {
            println!("I/O error: {}", msg);
        }
        Err(e) => {
            println!("Other error: {}", e);
        }
    }
    
    Ok(())
}

🧪 Testing

Running Tests

# Run all tests
cargo test

# Run specific test categories
cargo test unit_tests
cargo test integration_tests
cargo test error_tests
cargo test concurrent_tests

# Run with coverage
cargo test --lib -- --test-threads=1

# Run benchmarks (requires criterion feature)
cargo test --features criterion

Performance Testing

# Run performance tests
cargo test --features criterion performance_tests

# Run memory leak detection
cargo test --features debug_leak_detector

# Run concurrent stress tests
cargo test concurrent_tests -- --nocapture

📦 Optional Features

Enable additional functionality with Cargo features:

[dependencies.dictutils]
version = "0.1.0"
features = [
    "criterion",          # Performance benchmarks
    "rayon",              # Parallel processing
    "cli",                # Command-line tools
    "serde",              # Serialization support
    "debug_leaks"         # Memory leak detection
]

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Clone repository
git clone https://github.com/your-username/dictutils.git
cd dictutils

# Install development dependencies
cargo install cargo-watch
cargo install cargo-audit

# Run tests
cargo test

# Run linting
cargo fmt --check
cargo clippy --all-targets --all-features

# Run benchmarks
cargo bench --all-features

Adding New Dictionary Formats

To add support for a new dictionary format:

Implement the DictFormat trait
Implement the Dict trait for your format
Add format detection logic to DictLoader
Add comprehensive tests

Example template:

use dictutils::traits::*;

pub struct NewDict {
    // Your implementation
}

impl DictFormat<String> for NewDict {
    const FORMAT_NAME: &'static str = "newdict";
    
    fn is_valid_format(path: &Path) -> Result<bool> {
        // Implement format validation
        Ok(false) // Placeholder
    }
    
    fn load(path: &Path, config: DictConfig) -> Result<Box<dyn Dict<String>>> {
        // Implement format loading
        Err(DictError::UnsupportedOperation("Not implemented".to_string()))
    }
}

impl Dict<String> for NewDict {
    // Implement all required methods
    // ...
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

MDict format specification
StarDict documentation
ZIM format documentation
Rust ecosystem crates that made this possible

📊 Benchmarks

Performance results on typical hardware (Intel i7, 16GB RAM):

Operation	Small Dict (<1MB)	Medium Dict (10MB)	Large Dict (100MB)
Load Time	< 10ms	< 100ms	< 500ms
Exact Lookup	< 0.1ms	< 0.1ms	< 0.1ms
Prefix Search	< 1ms	< 5ms	< 20ms
Fuzzy Search	< 10ms	< 50ms	< 200ms
Full-Text Search	< 20ms	< 100ms	< 500ms

🆘 Support

Documentation: docs.rs/dictutils
Issues: GitHub Issues
Discussions: GitHub Discussions
Discord: Join our Discord server

Made with ❤️ by the DictUtils team

Commit count: 0