# bpe-tokenizer A Rust implementation of Byte Pair Encoding (BPE) tokenization. This crate provides functionality to tokenize text into subword units using pre-trained vocabularies. BPE is widely used in natural language processing (NLP) tasks, where it breaks down words into subword tokens using a vocabulary of the most frequent token pairs. It supports **Unicode-aware** text segmentation for sentence and word splitting, making it suitable for processing a variety of languages and scripts. ## Features - **Bring your own BPE token vocabularies**, or use ... - **Pre-trained multilingual vocabularies** sourced from the [BPEmb](https://github.com/bheinzerling/bpemb) project, with support for tokenizing text in **275 languages**. - **Unicode-aware sentence and word segmentation**: Leveraging the [`unicode-segmentation`](https://docs.rs/unicode-segmentation) crate for proper text splitting. ## Installation To add this crate to your project, run: ```bash cargo add bpe-tokenizer ``` Or manually include it in your `Cargo.toml`: ```toml [dependencies] bpe-tokenizer = "" ``` ## Full Example Here is an example of how to create a `BytePairEncoder` from a string and use it to tokenize text: ```rust use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError}; let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap(); let tokenized = vocab.tokenize("Hello, world!"); println!("{:?}", tokenized); ``` The output will be a vector of tokens: ```text ["", "▁hello", "▁world", ""] ``` Or load a vocabulary from a file: ```rust use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError}; let vocab = BytePairEncoder::new_from_file("path/to/file.vocab").unwrap(); ``` ## Cargo Features The crate also includes several sizes of default pre-trained vocabularies, which are **optional** and can be enabled via Cargo features. They are sourced from Wikipedia data, pre-trained as part of the [BPEmb](https://github.com/bheinzerling/bpemb) project. These MIT-licensed vocabularies support 275 languages and provide different sizes depending on usage needs: ### Available Optional Features - **`default-small` (100,000 tokens)**: Suitable for memory-constrained environments. - **`default-medium` (320,000 tokens)**: Balances between token coverage and memory efficiency. - **`default-large` (1,000,000 tokens)**: Provides the most detailed token representations for high granularity tasks. ### Enabling Optional Features To use these default vocabularies, specify the feature in your `Cargo.toml`: ```toml [dependencies] bpe-tokenizer = { version = "", features = ["default-medium"] } ``` ### Example with `default-medium` Vocabulary An example of using the **medium** vocabulary (320,000 tokens): ```rust # #[cfg(feature = "default-medium")] { use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError}; let encoder = BytePairEncoder::new_default_medium().unwrap(); let tokenized = encoder.tokenize("This is a test sentence."); println!("{:?}", tokenized); // Output: ["", "▁this", "▁is", "▁a", "▁test", "▁sentence", ""] # } ``` ## Tokenization Functions The crate provides various ways to interact with the tokenizer: - **Tokenize into a flat `Vec`**: - `BytePairEncoder::tokenize` Splits and flattens the text into tokens. ```rust let tokenized = vocab.tokenize("Example sentence."); // Output: ["", "▁example", "▁sentence", ""] ``` - **Tokenize into nested sentence vectors `Vec>`**: - `BytePairEncoder::tokenize_sentences` Useful for processing multiple sentences separately. ```rust let tokenized = vocab.tokenize_sentences("This is sentence one. And this is sentence two."); // Output: [["", "▁this", "▁is", "▁sentence", "▁one", ""], ["", "▁and", "▁this", "▁is", "▁sentence", "▁two", ""]] ``` - **Iterative tokenization**: - `BytePairEncoder::tokenize_iter` and `BytePairEncoder::tokenize_sentences_iter` Provides an iterator over generated tokens for better memory efficiency in large-scale text. ```rust let tokens_iter: Vec = vocab.tokenize_iter("Example sentence").collect(); // Output: ["", "▁example", "▁sentence", ""] ``` ## Licensing This crate is licensed under the [MIT License](LICENSE). ## Contributing Contributions are welcome! Please open an issue, submit a pull request, or reach out if you'd like to contribute awesome new features or fixes to this crate.