Crates.io | bpe-tokenizer |
lib.rs | bpe-tokenizer |
version | 0.1.4 |
source | src |
created_at | 2024-09-24 17:56:38.39648 |
updated_at | 2024-09-26 00:08:34.7389 |
description | A BPE Tokenizer library. |
homepage | https://github.com/swaits/bpe-tokenizer/ |
repository | https://github.com/swaits/bpe-tokenizer/ |
max_upload_size | 30000000 |
id | 1385502 |
size | 27,913,684 |
A Rust implementation of Byte Pair Encoding (BPE) tokenization. This crate provides functionality to tokenize text into subword units using pre-trained vocabularies. BPE is widely used in natural language processing (NLP) tasks, where it breaks down words into subword tokens using a vocabulary of the most frequent token pairs.
It supports Unicode-aware text segmentation for sentence and word splitting, making it suitable for processing a variety of languages and scripts.
unicode-segmentation
crate for proper text splitting.To add this crate to your project, run:
cargo add bpe-tokenizer
Or manually include it in your Cargo.toml
:
[dependencies]
bpe-tokenizer = "<version>"
Here is an example of how to create a BytePairEncoder
from a string and use it
to tokenize text:
use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");
println!("{:?}", tokenized);
The output will be a vector of tokens:
["<s>", "▁hello", "▁world", "</s>"]
Or load a vocabulary from a file:
use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
let vocab = BytePairEncoder::new_from_file("path/to/file.vocab").unwrap();
The crate also includes several sizes of default pre-trained vocabularies, which are optional and can be enabled via Cargo features. They are sourced from Wikipedia data, pre-trained as part of the BPEmb project. These MIT-licensed vocabularies support 275 languages and provide different sizes depending on usage needs:
default-small
(100,000 tokens): Suitable for memory-constrained environments.default-medium
(320,000 tokens): Balances between token coverage and memory efficiency.default-large
(1,000,000 tokens): Provides the most detailed token representations for high granularity tasks.To use these default vocabularies, specify the feature in your Cargo.toml
:
[dependencies]
bpe-tokenizer = { version = "<version>", features = ["default-medium"] }
default-medium
VocabularyAn example of using the medium vocabulary (320,000 tokens):
# #[cfg(feature = "default-medium")] {
use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
let encoder = BytePairEncoder::new_default_medium().unwrap();
let tokenized = encoder.tokenize("This is a test sentence.");
println!("{:?}", tokenized);
// Output: ["<s>", "▁this", "▁is", "▁a", "▁test", "▁sentence", "</s>"]
# }
The crate provides various ways to interact with the tokenizer:
Tokenize into a flat Vec<String>
:
BytePairEncoder::tokenize
Splits and flattens the text into tokens.
let tokenized = vocab.tokenize("Example sentence.");
// Output: ["<s>", "▁example", "▁sentence", "</s>"]
Tokenize into nested sentence vectors Vec<Vec<String>>
:
BytePairEncoder::tokenize_sentences
Useful for processing multiple sentences separately.
let tokenized = vocab.tokenize_sentences("This is sentence one. And this is sentence two.");
// Output: [["<s>", "▁this", "▁is", "▁sentence", "▁one", "</s>"], ["<s>", "▁and", "▁this", "▁is", "▁sentence", "▁two", "</s>"]]
Iterative tokenization:
BytePairEncoder::tokenize_iter
and BytePairEncoder::tokenize_sentences_iter
Provides an iterator over generated tokens for better memory efficiency in large-scale text.
let tokens_iter: Vec<String> = vocab.tokenize_iter("Example sentence").collect();
// Output: ["<s>", "▁example", "▁sentence", "</s>"]
This crate is licensed under the MIT License.
Contributions are welcome! Please open an issue, submit a pull request, or reach out if you'd like to contribute awesome new features or fixes to this crate.