| Crates.io | liendl_tokenizer |
| lib.rs | liendl_tokenizer |
| version | 0.1.0 |
| created_at | 2025-04-21 21:49:42.890135+00 |
| updated_at | 2025-04-21 21:49:42.890135+00 |
| description | A simple BPE tokenizer for Rust |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1643275 |
| size | 30,515 |
This is a simple BPE-Tokenizer I've created for the electoral module Foundational Generative Models at my university. I will document everything here in this repo and on the website I'm going to build where you can also test the tokenizer on different versions with different methods.
| Status | Feature |
|---|---|
| ✅ | Text Loader supporting .TXT files |
| ✅ | Training process function |
| ✅ | Tokenize text function |
| ✅ | Export function for vocabulary |
| 🕣 | Website to show off tokenizer |
| ✅ | Support Case-Sensitivity |
| ✅ | Support UTF-8 text |
| ☑️ | Handle unknown characters using the Out-of-Vocabulary method |
| 🕣 | Train Tokenizer on English and German text |
| ✅ | Add token decoding |
| ☑️ | Support additional languages like Durch, Spanish, Polish, French, etc. |
| ☑️ | Support additional file formats like CSV or Excel for Text Loader |
| ✅ | Convert to library |
| 🕣 | Include ability to add special tokens |