jmdict-fast

Crates.iojmdict-fast
lib.rsjmdict-fast
version0.1.1
created_at2025-10-24 11:40:42.655365+00
updated_at2025-10-24 11:40:42.655365+00
descriptionBlazing-fast Japanese dictionary engine with FST-based indexing
homepagehttps://github.com/theGlenn/jmdict-fst
repositoryhttps://github.com/theGlenn/jmdict-fst
max_upload_size
id1898326
size118,693
Glenn Sonna (theGlenn)

documentation

https://docs.rs/jmdict-fast

README

๐Ÿš€ jmdict-fast

Blazing-fast, Japanese dictionary engine

Rust License Build Status

Note: This crate uses bunpo for Japanese conjugation handling. Both crates are part of the same monorepo but are published separately to crates.io.


โœจ Features

  • ๐Ÿ’พ Compile-time indexed data โ€” FST + binary blob for maximum efficiency
  • โšก Instant lookups โ€” O(log n) exact matching across all writing systems
  • ๐Ÿ”Ž Multimodal search โ€” Kanji, kana, and romaji support
  • ๐Ÿ“ฆ Ergonomic Rust API โ€” Usable as a library or binary
  • ๐Ÿชถ Tiny binary โ€” Zero runtime parsing, no allocations during lookup
  • ๐ŸŽฏ Memory-mapped โ€” Zero-copy access to all dictionary data

๐ŸŽ๏ธ Performance at a Glance

Metric Value
Index Size ~888KB (FSTs)
Data Size 16MB binary blob
Entries 22,569
Unique Keys 24,342
Lookup Speed O(log n), instant
Memory Usage Memory-mapped, zero allocations

๐Ÿš€ Quick Start

Building the Dictionary

cargo build

This creates:

  • OUT_DIR/kanji.fst โ€” Kanji lookup index
  • OUT_DIR/kana.fst โ€” Kana lookup index
  • OUT_DIR/romaji.fst โ€” Romaji lookup index
  • OUT_DIR/entries.bin โ€” Binary blob with all entries

Using the Library

Search - Prefix

use jmdict_fast::Dict;

fn main() -> anyhow::Result<()> {
    let dict = Dict::load_default()?;
    let results = dict.lookup_partial("ใ“ใ‚“ใซ");
    for entry in &results {
        println!("Found: {:?}", entry.kanji);
        println!("Reading: {:?}", entry.kana);
        println!("Meanings: {:?}", entry.sense[0].gloss);
    }
    Ok(())
}

Search Exact

use jmdict_fast::Dict;

fn main() -> anyhow::Result<()> {
    let dict = Dict::load_default()?;
    let results = dict.lookup_exact("ใ“ใ‚“ใซใกใฏ");
    for entry in &results {
        println!("Found: {:?}", entry.kanji);
        println!("Reading: {:?}", entry.kana);
        println!("Meanings: {:?}", entry.sense[0].gloss);
    }
    Ok(())
}

๐Ÿ“Š Data Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   kanji.fst     โ”‚    โ”‚   kana.fst      โ”‚    โ”‚  romaji.fst     โ”‚
โ”‚   (243KB)       โ”‚    โ”‚   (257KB)       โ”‚    โ”‚   (388KB)       โ”‚
โ”‚                 โ”‚    โ”‚                 โ”‚    โ”‚                 โ”‚
โ”‚ ๆผขๅญ— โ†’ Entry ID  โ”‚    โ”‚ ใ‹ใช โ†’ Entry ID  โ”‚    โ”‚ romaji โ†’ Entry IDโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                       โ”‚                       โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  entries.bin    โ”‚
                    โ”‚    (16MB)       โ”‚
                    โ”‚                 โ”‚
                    โ”‚ Offset Table    โ”‚
                    โ”‚ + JSON Entries  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ง API Reference

  • Dict::load<P: AsRef<Path>>(base_dir: P) -> Result<Self> โ€” Loads the dictionary from the specified directory.
  • dict.lookup_exact(term: &str) -> Vec<Entry> โ€” Performs exact lookup across all writing systems.

Entry Structure:

pub struct Entry {
    pub id: String,                    // JMdict entry ID
    pub kanji: Vec<KanjiEntry>,        // Kanji forms
    pub kana: Vec<KanaEntry>,          // Kana readings
    pub sense: Vec<SenseEntry>,        // Meanings and metadata
}

pub struct KanjiEntry {
    pub common: bool,                  // Is this a common kanji?
    pub text: String,                  // The kanji text
    pub tags: Vec<String>,             // JMdict tags
}

pub struct KanaEntry {
    pub common: bool,                  // Is this a common reading?
    pub text: String,                  // The kana text
    pub tags: Vec<String>,             // JMdict tags
    pub applies_to_kanji: Vec<String>, // Which kanji this applies to
}

pub struct SenseEntry {
    pub part_of_speech: Vec<String>,   // Grammatical information
    pub applies_to_kanji: Vec<String>, // Which kanji this sense applies to
    pub applies_to_kana: Vec<String>,  // Which kana this sense applies to
    pub gloss: Vec<GlossEntry>,        // English translations
    // ... other JMdict fields
}

๐Ÿ› ๏ธ Development

Caching System

The build script implements a robust caching system to avoid re-downloading the large JMdict dataset. See CACHING.md and CACHE_QUICK_REFERENCE.md for details.


๐Ÿ” How It Works

  1. Build Phase: The build tool processes the JMdict JSON and creates FST indexes and a binary blob for instant retrieval.
  2. Runtime Phase: The library provides memory-mapped loading, FST-based lookups, and efficient entry retrieval.

๐Ÿ“ˆ Real Benchmark Results

Criterion (lookup_word.rs) โ€” MacBook, Rust 1.70+

lookup_exact ็Œซ (jmdict-fast)
    time:   [4.06 ยตs]
lookup_word ็Œซ (jmdict)
    time:   [511.96 ยตs]
  • jmdict-fast is ~125x faster than a traditional filter-based approach for exact lookups.
  • Both methods are stable, but jmdict-fast is highly optimized for speed and memory.

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

๐Ÿ“„ License

MIT License โ€” see LICENSE for details.


๐Ÿ™ Acknowledgments


Built with โค๏ธ and Rust ๐Ÿฆ€

Commit count: 0

cargo fmt