gpt_tokenizer

Crates.io	gpt_tokenizer
lib.rs	gpt_tokenizer
version	0.1.0
created_at	2023-03-17 16:49:54.05399+00
updated_at	2023-03-17 16:49:54.05399+00
description	Rust BPE Encoder Decoder (Tokenizer) for GPT-2 / GPT-3
homepage
repository	https://github.com/cloudbridgeuy/a/tree/main/lib/tokenizer
max_upload_size
id	812847
size	1,514,304

Cloud Bridge UY (cloudbridgeuy)

documentation

README

GPT-Tokenizer

An implementation of the GPT-3 tokenizer created by converting the GPT-3-Encoder JavaScript package to Rust (with the help of ChatGPT-4). You can use it to estimate the number of tokens that your prompt would approximately consume. You can also create your own custom encoding and decoding functions by providing your own encoder.json and vocab.bpe files.

As a rule of thumb, OpenAI suggest that 100 tokens equal 75 words.

See how it works against the tokenizer published by OpenAI:

https://platform.openai.com/tokenizer

use tokenizer::DefaultTokenizer;

fn main() {
    let tokenizer = DefaultTokenizer::new();

    let text = r#"I'Many words map to one token, but some don't: indivisible.

Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾

Sequences of characters commonly found next to each other may be grouped together: 1234567890"#;

    let encoded = &tokenizer.encode(text);
    let decoded = &tokenizer.decode(encoded);

    println!("Original text: {}", text);
    println!("Encoded text: {:#?}", encoded);
    println!("Decoded text: {}", decoded

    println!("Text size: {}", text.len());
    println!("Words: {}", text.split(" ").count());
    println!("Rule of Thumb: {}", text.split(" ").count() * 4 / 3);
    println!("Tokens: {}", encoded.len());
}

See the ./examples directory to see more examples of how to use it.

Commit count: 0

gpt_tokenizer

documentation

README

GPT-Tokenizer

cargo fmt