# GPT-Tokenizer An implementation of the GPT-3 tokenizer created by converting the [`GPT-3-Encoder`](https://www.npmjs.com/package/gpt-3-encoder) JavaScript package to Rust (with the help of ChatGPT-4). You can use it to estimate the number of tokens that your prompt would approximately consume. You can also create your own custom `encoding` and `decoding` functions by providing your own `encoder.json` and `vocab.bpe` files. > As a rule of thumb, OpenAI suggest that 100 tokens equal 75 words. See how it works against the tokenizer published by OpenAI: [https://platform.openai.com/tokenizer](https://platform.openai.com/tokenizer) ```rust use tokenizer::DefaultTokenizer; fn main() { let tokenizer = DefaultTokenizer::new(); let text = r#"I'Many words map to one token, but some don't: indivisible. Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾 Sequences of characters commonly found next to each other may be grouped together: 1234567890"#; let encoded = &tokenizer.encode(text); let decoded = &tokenizer.decode(encoded); println!("Original text: {}", text); println!("Encoded text: {:#?}", encoded); println!("Decoded text: {}", decoded println!("Text size: {}", text.len()); println!("Words: {}", text.split(" ").count()); println!("Rule of Thumb: {}", text.split(" ").count() * 4 / 3); println!("Tokens: {}", encoded.len()); } ``` See the [./examples](./examples) directory to see more examples of how to use it.