Crates.io | bpetok |
lib.rs | bpetok |
version | 0.1.2 |
source | src |
created_at | 2024-09-26 02:44:37.095635 |
updated_at | 2024-09-26 03:34:50.881585 |
description | A simple CLI for tokenizing text input using Byte Pair Encoding (BPE). |
homepage | |
repository | |
max_upload_size | |
id | 1386902 |
size | 29,036 |
bpetok
is a simple command-line interface (CLI) application written in Rust for tokenizing text input using Byte Pair Encoding (BPE). The primary goal of the tool is to provide efficient and flexible tokenization for various applications that rely on text processing, natural language processing (NLP), or any pipeline where tokenized input is necessary.
Given an input text stream from stdin, bpetok
produces tokenized sentences to stdout. It supports multiple built-in vocabulary sizes (small, medium, large), and also allows for the loading of custom vocabularies.
You can install bpetok
directly from crates.io using Cargo:
cargo install bpetok
Once installed, the bpetok
binary is available to use globally on your system.
bpetok [OPTIONS]
Default Vocabulary Size (medium): By default, it uses the medium vocabulary (320k tokens).
-s
, --small
: Use the smaller vocabulary (100k tokens).
-l
, --large
: Use the larger vocabulary (1M tokens).
-v
, --vocab FILE
: Path to custom BPE vocabulary file. When this flag is set, the built-in vocabularies are ignored.
A BPE vocabulary file is expected to follow this format:
<token>\t<score>\n
Each line should consist of:
\t
)Example lines from the file:
<unk> 0
<s> 0
</s> 0
00 -0
an -1
▁d -2
en -3
er -4
▁s -5
in -6
▁p -7
ar -8
▁a -9
▁00 -10
▁m -11
▁t -12
es -13
on -14
▁k -15
or -16
▁n -17
la -18
▁b -19
is -20
▁c -21
echo "Hello world" | bpetok
echo "Hello world" | bpetok --small
echo "Hello world" | bpetok --large
echo "Hello world" | bpetok --vocab path/to/vocabulary.bpe
If an invalid vocabulary file is provided using the --vocab
option, the program will gracefully fail and print an error message to stderr
.
Similarly, any issues with initializing the default vocabularies (small, medium, or large) will result in an error with appropriate feedback printed to the terminal.
To contribute or modify bpetok
, you'll need a working installation of the Rust toolchain. Once set up, feel free to modify or extend the functionality of the tool and submit a PR.
The default pre-built small, medium, and large vocabularies are 275-language multilingual vocabularies that were originally trained on Wikipedia by the BPEmb project. We thank the contributors from BPEmb for making these vocabularies open and accessible to the community.
This project is licensed under the MIT License. See the LICENSE file for more details.