liendl_tokenizer

Crates.ioliendl_tokenizer
lib.rsliendl_tokenizer
version0.1.0
created_at2025-04-21 21:49:42.890135+00
updated_at2025-04-21 21:49:42.890135+00
descriptionA simple BPE tokenizer for Rust
homepage
repository
max_upload_size
id1643275
size30,515
Jonas Liendl (jonasliendl)

documentation

README

BPE-Tokenizer

This is a simple BPE-Tokenizer I've created for the electoral module Foundational Generative Models at my university. I will document everything here in this repo and on the website I'm going to build where you can also test the tokenizer on different versions with different methods.

🛣️ Roadmap

Status Feature
Text Loader supporting .TXT files
Training process function
Tokenize text function
Export function for vocabulary
🕣 Website to show off tokenizer
Support Case-Sensitivity
Support UTF-8 text
☑️ Handle unknown characters using the Out-of-Vocabulary method
🕣 Train Tokenizer on English and German text
Add token decoding
☑️ Support additional languages like Durch, Spanish, Polish, French, etc.
☑️ Support additional file formats like CSV or Excel for Text Loader
Convert to library
🕣 Include ability to add special tokens
Commit count: 0

cargo fmt