mecrab-word2vec

Crates.iomecrab-word2vec
lib.rsmecrab-word2vec
version0.1.0
created_at2026-01-05 23:44:00.309783+00
updated_at2026-01-05 23:44:00.309783+00
descriptionHigh-performance Word2Vec implementation with Hogwild! parallelization for MeCrab
homepage
repositoryhttps://github.com/cool-japan/mecrab
max_upload_size
id2024808
size68,827
KitaSan (cool-japan)

documentation

README

mecrab-word2vec

High-performance Word2Vec training library for Japanese text, optimized for multi-core CPUs.

Features

  • Skip-gram with Negative Sampling - Industry-standard algorithm
  • Hogwild! Parallelization - Lock-free multi-threading (83% efficiency on 6 cores)
  • MCV1 Binary Format - Memory-mapped vector storage for instant loading
  • Pure Rust - Memory-safe, no external C dependencies
  • Zero-copy - Direct pointer arithmetic, minimal allocations

Quick Start

Training Vectors

use mecrab_word2vec::{Word2VecBuilder, TrainingConfig};

let model = Word2VecBuilder::new()
    .vector_size(100)
    .window_size(5)
    .negative_samples(5)
    .min_count(10)
    .epochs(3)
    .threads(6)
    .build_from_corpus("corpus_word_ids.txt")?;

model.save_mcv1("vectors.bin", 100000)?;

Using Trained Vectors

use mecrab::vectors::VectorStorage;

let vectors = VectorStorage::load_mcv1("vectors.bin")?;
let vec1 = vectors.get(42)?;  // Get vector for word_id 42
let vec2 = vectors.get(123)?;

let similarity = vectors.cosine_similarity(42, 123)?;

Performance

Tested on Japanese Wikipedia corpus (1B+ words):

Metric Value
Training Speed ~500K words/sec/core
CPU Efficiency 83% (6 cores)
Memory Usage ~2GB for 160K vocab
Vector Lookup O(1) with memory-mapping

Input Format

Corpus File (Word IDs)

42 123 456 789
111 222 333
...

Each line is a sentence. Each number is a word_id (from MeCrab's vocabulary).

Output Formats

MCV1 Binary (Recommended)

model.save_mcv1("vectors.bin", max_word_id)?;

Benefits:

  • Memory-mapped (instant loading, no RAM copy)
  • Direct indexing by word_id
  • Compatible with MeCrab's semantic features

Word2Vec Text Format

model.save_text("vectors.txt")?;

Format:

163922 100
word1 0.123 -0.456 0.789 ...
word2 -0.234 0.567 -0.890 ...

Architecture

See IMPLEMENTATION.md for technical details on:

  • Hogwild! lock-free parallelization
  • Direct pointer arithmetic optimization
  • Safety guarantees and race condition analysis

Configuration

TrainingConfig {
    vector_size: 100,           // Embedding dimension
    window_size: 5,             // Context window
    negative_samples: 5,        // Negative samples per positive
    min_count: 10,              // Minimum word frequency
    sample: 1e-4,               // Subsampling threshold
    alpha: 0.025,               // Initial learning rate
    min_alpha: 0.0001,          // Final learning rate
    epochs: 3,                  // Training epochs
    threads: 6,                 // Parallel threads
}

License

See project LICENSE file.

References

  • Mikolov et al. (2013) - "Distributed Representations of Words and Phrases and their Compositionality"
  • Recht et al. (2011) - "Hogwild!: A Lock-Free Approach to Parallelizing SGD"
Commit count: 1

cargo fmt