gpt_tokenizer

Crates.iogpt_tokenizer
lib.rsgpt_tokenizer
version0.1.0
sourcesrc
created_at2023-03-17 16:49:54.05399
updated_at2023-03-17 16:49:54.05399
descriptionRust BPE Encoder Decoder (Tokenizer) for GPT-2 / GPT-3
homepage
repositoryhttps://github.com/cloudbridgeuy/a/tree/main/lib/tokenizer
max_upload_size
id812847
size1,514,304
Cloud Bridge UY (cloudbridgeuy)

documentation

README

GPT-Tokenizer

An implementation of the GPT-3 tokenizer created by converting the GPT-3-Encoder JavaScript package to Rust (with the help of ChatGPT-4). You can use it to estimate the number of tokens that your prompt would approximately consume. You can also create your own custom encoding and decoding functions by providing your own encoder.json and vocab.bpe files.

As a rule of thumb, OpenAI suggest that 100 tokens equal 75 words.

See how it works against the tokenizer published by OpenAI:

https://platform.openai.com/tokenizer

use tokenizer::DefaultTokenizer;

fn main() {
    let tokenizer = DefaultTokenizer::new();

    let text = r#"I'Many words map to one token, but some don't: indivisible.

Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾

Sequences of characters commonly found next to each other may be grouped together: 1234567890"#;

    let encoded = &tokenizer.encode(text);
    let decoded = &tokenizer.decode(encoded);

    println!("Original text: {}", text);
    println!("Encoded text: {:#?}", encoded);
    println!("Decoded text: {}", decoded

    println!("Text size: {}", text.len());
    println!("Words: {}", text.split(" ").count());
    println!("Rule of Thumb: {}", text.split(" ").count() * 4 / 3);
    println!("Tokens: {}", encoded.len());
}

See the ./examples directory to see more examples of how to use it.

Commit count: 0

cargo fmt