| Crates.io | axonml-text |
| lib.rs | axonml-text |
| version | 0.2.4 |
| created_at | 2026-01-19 21:36:37.822848+00 |
| updated_at | 2026-01-25 22:31:16.925352+00 |
| description | Text processing utilities for the Axonml ML framework |
| homepage | |
| repository | |
| max_upload_size | |
| id | 2055402 |
| size | 69,539 |
axonml-text provides natural language processing utilities for the AxonML machine learning framework. It includes vocabulary management, multiple tokenization strategies, and dataset implementations for common NLP tasks like text classification, language modeling, and sequence-to-sequence learning.
| Module | Description |
|---|---|
vocab |
Vocabulary management with token-to-index mapping and special token support |
tokenizer |
Tokenizer trait and implementations (Whitespace, Char, WordPunct, NGram, BPE, Unigram) |
datasets |
Dataset implementations for text classification, language modeling, and seq2seq tasks |
Add the dependency to your Cargo.toml:
[dependencies]
axonml-text = "0.1.0"
use axonml_text::prelude::*;
// Build vocabulary from text with minimum frequency threshold
let text = "the quick brown fox jumps over the lazy dog";
let vocab = Vocab::from_text(text, 1);
// Or create with special tokens
let mut vocab = Vocab::with_special_tokens();
vocab.add_token("hello");
vocab.add_token("world");
// Encode and decode
let indices = vocab.encode(&["hello", "world"]);
let tokens = vocab.decode(&indices);
use axonml_text::prelude::*;
// Whitespace tokenizer
let tokenizer = WhitespaceTokenizer::new();
let tokens = tokenizer.tokenize("Hello World"); // ["Hello", "World"]
// Character-level tokenizer
let char_tokenizer = CharTokenizer::new();
let chars = char_tokenizer.tokenize("Hi!"); // ["H", "i", "!"]
// Word-punctuation tokenizer
let wp_tokenizer = WordPunctTokenizer::lowercase();
let tokens = wp_tokenizer.tokenize("Hello, World!"); // ["hello", ",", "world", "!"]
// N-gram tokenizer
let bigrams = NGramTokenizer::word_ngrams(2);
let tokens = bigrams.tokenize("one two three"); // ["one two", "two three"]
// BPE tokenizer
let mut bpe = BasicBPETokenizer::new();
bpe.train("low lower lowest newer newest", 10);
let tokens = bpe.tokenize("lowest");
use axonml_text::prelude::*;
let samples = vec![
("good movie".to_string(), 1),
("bad movie".to_string(), 0),
("great film".to_string(), 1),
("terrible movie".to_string(), 0),
];
let tokenizer = WhitespaceTokenizer::new();
let dataset = TextDataset::from_samples(&samples, &tokenizer, 1, 10);
// Use with DataLoader
let loader = DataLoader::new(dataset, 16);
for batch in loader.iter() {
// batch.data: [batch_size, max_length]
// batch.target: [batch_size, num_classes]
}
use axonml_text::prelude::*;
let text = "one two three four five six seven eight nine ten";
let dataset = LanguageModelDataset::from_text(text, 3, 1);
let (input, target) = dataset.get(0).unwrap();
// input: [seq_length] - tokens at positions 0..seq_length
// target: [seq_length] - tokens at positions 1..seq_length+1
use axonml_text::prelude::*;
// Sentiment dataset for testing
let sentiment = SyntheticSentimentDataset::small(); // 100 samples
let sentiment = SyntheticSentimentDataset::train(); // 10000 samples
// Seq2seq copy/reverse task
let seq2seq = SyntheticSeq2SeqDataset::copy_task(100, 5, 50);
Run the test suite:
cargo test -p axonml-text
Licensed under either of:
at your option.