| Crates.io | skimtoken |
| lib.rs | skimtoken |
| version | 0.2.2 |
| created_at | 2025-07-04 12:40:40.57032+00 |
| updated_at | 2025-07-08 09:09:12.247775+00 |
| description | Fast token count estimation library |
| homepage | https://github.com/masaishi/skimtoken |
| repository | https://github.com/masaishi/skimtoken |
| max_upload_size | |
| id | 1737980 |
| size | 691,879 |
⚠️ WARNING: This is an early beta version. The current implementation is not production-ready.
A lightweight, fast token count estimation library written in Rust with Python bindings.
The Problem: tiktoken is great for precise tokenization, but requires ~59.6MB of memory just to count tokens - problematic for memory-constrained environments.
The Solution: skimtoken estimates token counts using statistical patterns instead of loading entire vocabularies, achieving:
pip install skimtoken
Requirements: Python 3.9+
Simple method (Just char length x coefficient):
from skimtoken import estimate_tokens
# Basic usage
text = "Hello, world! How are you today?"
token_count = estimate_tokens(text)
print(f"Estimated tokens: {token_count}")
Multilingual simple method:
from skimtoken.multilingual_single import estimate_tokens
multilingual_text = """
For non-space separated languages, the number of tokens is difficult to predict.
スペースで区切られていない言語の場合トークン数を予測するのは難しいです。
स्पेसद्वारावियोजितनहींभाषाओंकेलिएटोकनकीसंख्याकाअनुमानलगानाकठिनहै।
بالنسبةللغاتالتيلاتفصلبمسافاتفإنالتنبؤبعددالرموزصعب
"""
token_count = estimate_tokens(multilingual_text)
print(f"Estimated tokens (multilingual): {token_count}")
| Use Case | Why It Works | Example |
|---|---|---|
| Rate Limiting | Overestimating is safe | Prevent API quota exceeded |
| Cost Estimation | Users prefer conservative estimates | "$0.13" (actual: $0.10) |
| Progress Bars | Approximate progress is fine | Processing documents |
| Serverless/Edge | Memory constraints (128MB limits) | Cloudflare Workers |
| Quick Filtering | Remove obviously too-long content | Pre-screening |
| Model Switching | Switch to smart model when context long | Auto-escalation |
| Use Case | Why It Fails | Use Instead |
|---|---|---|
| Context Limits | Underestimating causes failures | tiktoken |
| Exact Billing | 15% error = unhappy customers | tiktoken |
| Token Splitting | Chunks might exceed limits | tiktoken |
| Embeddings | Need exact token boundaries | tiktoken |
Multilingual single method:
Results:
Total Samples: 100,726
Total Characters: 13,062,391
Mean RMSE: 21.3034 tokens
Mean Error Rate: 15.11%
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ tiktoken ┃ skimtoken ┃ Ratio ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ Init Time │ 1.005490 s │ 0.002389 s │ 0.002x │
├──────────────┼────────────┼────────────┼────────┤
│ Init Memory │ 42.2310 MB │ 0.0265 MB │ 0.001x │
├──────────────┼────────────┼────────────┼────────┤
│ Exec Time │ 6.689203 s │ 6.911931 s │ 1.033x │
├──────────────┼────────────┼────────────┼────────┤
│ Exec Memory │ 17.3251 MB │ 0.8950 MB │ 0.052x │
├──────────────┼────────────┼────────────┼────────┤
│ Total Time │ 7.694694 s │ 6.914320 s │ 0.899x │
├──────────────┼────────────┼────────────┼────────┤
│ Total Memory │ 59.5561 MB │ 0.9215 MB │ 0.015x │
└──────────────┴────────────┴────────────┴────────┘
For up-to-date performance comparisons and detailed accuracy metrics across all methods, visit the skimtoken_benchmark repository. This automated benchmark suite:
| Method | Import | Memory | Error | Best For |
|---|---|---|---|---|
| Simple | from skimtoken.simple import estimate_tokens |
1.0MB | ~21.63% | English text, minimum memory |
| Basic | from skimtoken.basic import estimate_tokens |
0.9MB | ~27.05% | General use |
| Multilingual | from skimtoken.multilingual import estimate_tokens |
0.9MB | ~15.93% | Non-English, mixed languages |
| Multilingual Simple | from skimtoken.multilingual_simple import estimate_tokens |
0.9MB | ~15.11% | Fast multilingual estimation |
# Example: Choose method based on your needs
if memory_critical:
from skimtoken.simple import estimate_tokens
elif mixed_languages:
from skimtoken.multilingual import estimate_tokens
else:
from skimtoken import estimate_tokens # Default: simple
# From command line
echo "Hello, world!" | skimtoken
# Output: 5
# From file
skimtoken -f document.txt
# Output: 236
# Multiple files
cat *.md | skimtoken
# Output: 4846
Unlike tiktoken's vocabulary-based approach, skimtoken uses statistical patterns:
tiktoken:
Text → Tokenizer → ["Hello", ",", " world"] → Vocabulary Lookup → [1234, 11, 4567] → Count: 3
↑
Requires 60MB dictionary
skimtoken:
Text → Feature Extraction → {chars: 13, words: 2, lang: "en"} → Statistical Model → ~3 tokens
↑
Only 0.92MB of parameters
Improve accuracy on domain-specific content:
# 1. Prepare labeled data
# Format: {"text": "your content", "actual_tokens": 123}
uv run scripts/prepare_dataset.py --input your_texts.txt
# 2. Optimize parameters
uv run scripts/optimize_all.py --dataset your_data.jsonl
# 3. Rebuild with custom parameters
uv run maturin build --release
skimtoken/
├── src/
│ ├── lib.rs # Core Rust library with PyO3 bindings
│ └── methods/
│ ├── method_simple.rs # Character-based estimation
│ ├── method_basic.rs # Multi-feature regression
│ └── method_multilingual.rs # Language-aware estimation
├── skimtoken/ # Python package
│ ├── __init__.py # Main API
│ └── {method}.py # Method-specific imports
├── params/ # Learned parameters (TOML)
└── scripts/
├── benchmark.py # Performance testing
└── optimize/ # Parameter training
# Setup
git clone https://github.com/masaishi/skimtoken
cd skimtoken
uv sync
# Development build
uv run maturin dev --features python
# Run tests
cargo test
uv run pytest
# Benchmark
uv run scripts/benchmark.py
Q: Can I improve accuracy?
A: Yes! You can adjust the parameters using your own data to improve accuracy. See Advanced Usage for details.
Q: Is the API stable?
A: Beta = breaking changes possible.
We are actively working to improve skimtoken's accuracy and performance:
MIT License - see LICENSE for details.