| Crates.io | mecrab-word2vec |
| lib.rs | mecrab-word2vec |
| version | 0.1.0 |
| created_at | 2026-01-05 23:44:00.309783+00 |
| updated_at | 2026-01-05 23:44:00.309783+00 |
| description | High-performance Word2Vec implementation with Hogwild! parallelization for MeCrab |
| homepage | |
| repository | https://github.com/cool-japan/mecrab |
| max_upload_size | |
| id | 2024808 |
| size | 68,827 |
High-performance Word2Vec training library for Japanese text, optimized for multi-core CPUs.
use mecrab_word2vec::{Word2VecBuilder, TrainingConfig};
let model = Word2VecBuilder::new()
.vector_size(100)
.window_size(5)
.negative_samples(5)
.min_count(10)
.epochs(3)
.threads(6)
.build_from_corpus("corpus_word_ids.txt")?;
model.save_mcv1("vectors.bin", 100000)?;
use mecrab::vectors::VectorStorage;
let vectors = VectorStorage::load_mcv1("vectors.bin")?;
let vec1 = vectors.get(42)?; // Get vector for word_id 42
let vec2 = vectors.get(123)?;
let similarity = vectors.cosine_similarity(42, 123)?;
Tested on Japanese Wikipedia corpus (1B+ words):
| Metric | Value |
|---|---|
| Training Speed | ~500K words/sec/core |
| CPU Efficiency | 83% (6 cores) |
| Memory Usage | ~2GB for 160K vocab |
| Vector Lookup | O(1) with memory-mapping |
42 123 456 789
111 222 333
...
Each line is a sentence. Each number is a word_id (from MeCrab's vocabulary).
model.save_mcv1("vectors.bin", max_word_id)?;
Benefits:
model.save_text("vectors.txt")?;
Format:
163922 100
word1 0.123 -0.456 0.789 ...
word2 -0.234 0.567 -0.890 ...
See IMPLEMENTATION.md for technical details on:
TrainingConfig {
vector_size: 100, // Embedding dimension
window_size: 5, // Context window
negative_samples: 5, // Negative samples per positive
min_count: 10, // Minimum word frequency
sample: 1e-4, // Subsampling threshold
alpha: 0.025, // Initial learning rate
min_alpha: 0.0001, // Final learning rate
epochs: 3, // Training epochs
threads: 6, // Parallel threads
}
See project LICENSE file.