| Crates.io | llm-token-saver-rs |
| lib.rs | llm-token-saver-rs |
| version | 0.1.0 |
| created_at | 2026-01-19 01:07:19.361617+00 |
| updated_at | 2026-01-19 01:07:19.361617+00 |
| description | Rust crate for reducing LLM token consumption in production applications. |
| homepage | |
| repository | https://github.com/snailer-team/llm-token-saver-rs |
| max_upload_size | |
| id | 2053429 |
| size | 166,293 |
Rust crate for reducing LLM token consumption in production applications.
This crate is extracted from snailer.ai (AI coding agent; 5.4K+ downloads). In production we reduced average token usage by ~30–40% while maintaining response quality.
cargo add llm-token-saver-rs
This works only after the crate is published to crates.io.
cargo add llm-token-saver-rs --git https://github.com/snailer-team/llm-token-saver-rs
Add this to your app’s Cargo.toml:
[dependencies]
llm-token-saver-rs = { path = "../llm-token-saver-rs" }
In code, import it as llm_token_saver_rs:
use llm_token_saver_rs::UnifiedContextManager;
use llm_token_saver_rs::UnifiedContextManager;
use serde_json::json;
# async fn demo() -> anyhow::Result<()> {
let mut mgr = UnifiedContextManager::new("claude-3-5-sonnet");
let messages = vec![
json!({"role":"system","content":"You are a helpful assistant."}),
json!({"role":"user","content":"Please refactor this module."}),
json!({"role":"assistant","content":"Sure—what constraints?"}),
];
let tokens_before = mgr.estimate_tokens(&messages);
let compressed = mgr.compress_tier1_extractive(messages, None).await?;
let tokens_after = mgr.estimate_tokens(&compressed);
println!("{} -> {} tokens", tokens_before, tokens_after);
# Ok(()) }
Run the test suite (includes budget invariants and a smoke test that Tier 1 reduces token estimates):
cargo test
Run a deterministic efficiency demo that prints before/after token estimates and time:
cargo run --release --example efficiency
Example output:
llm-token-saver-rs efficiency demo (Tier 1)
before_tokens: 116128
after_tokens: 11589
saved_tokens: 104539 (90.0%)
elapsed_ms: 7
budget: 8000 (bounded_tokens=7775)
Notes:
estimate_tokens() is a heuristic (≈ 1 token per 4 chars). Use provider usage numbers for billing-accurate metrics.Tier 2–5 can call an LLM to summarize parts of context. You provide the client by implementing LlmClient.
You'll typically want these dependencies in your app:
cargo add anyhow async-trait serde_json
use llm_token_saver_rs::{LlmClient, UnifiedContextManager};
use async_trait::async_trait;
use serde_json::json;
struct MyClient;
#[async_trait]
impl LlmClient for MyClient {
async fn send_message_simple(&self, messages: Vec<serde_json::Value>) -> anyhow::Result<String> {
// Call your provider here (OpenAI/Anthropic/etc) and return the text response.
// This crate intentionally does not bundle a provider SDK.
let _ = messages;
Ok("summary".to_string())
}
}
# async fn demo() -> anyhow::Result<()> {
let mut mgr = UnifiedContextManager::new("claude-3-5-sonnet");
let client = MyClient;
let messages = vec![
json!({"role":"user","content":"Long history ..."}),
json!({"role":"assistant","content":"More ..."}),
];
let compressed = mgr
.compress_tier2_query_aware(messages, "What did we decide about API design?", &client)
.await?;
println!("tokens: {}", mgr.estimate_tokens(&compressed));
# Ok(()) }
Note: examples are async; run them inside an async runtime like tokio.
If you must guarantee a max budget, use enforce_budget() (deterministic) or compress_hard_cap() (may summarize via LlmClient depending on tier choices).
use llm_token_saver_rs::UnifiedContextManager;
use serde_json::json;
let mgr = UnifiedContextManager::new("claude-3-5-sonnet");
let budget = 8_000;
let messages = vec![json!({"role":"user","content":"..."})];
let bounded = mgr.enforce_budget(messages, budget);
assert!(mgr.estimate_tokens(&bounded) <= budget);
Use SelectiveContextFilter for long text blobs where you want to keep only the most informative parts.
use llm_token_saver_rs::{BudgetPolicy, SelectiveContextFilter};
# async fn demo() -> anyhow::Result<()> {
let filter = SelectiveContextFilter::new("cheap-model");
let policy = BudgetPolicy::default();
let content = "very long file content ...";
let filtered = filter
.filter_file_content(content, Some("bug in parser"), &policy)
.await?;
println!("{}", filtered);
# Ok(()) }
This crate currently operates on serde_json::Value messages to stay compatible with common provider payload formats.
Minimum supported shape:
{"role": "...", "content": "..."} or {"role":"...", "content":[{"type":"text","text":"..."}]}If you use tool calls, you may also want:
UnifiedContextManager::repair_claude_tool_integrity()UnifiedContextManager::repair_tool_call_integrity()UnifiedContextManager::remove_empty_openai_messages()UnifiedContextManager::new() can read optional environment overrides:
LLM_TOKEN_SAVER_MAX_CONTEXT_TOKENSLLM_TOKEN_SAVER_CONTEXT_PRIMACY_SIZE (default 3)LLM_TOKEN_SAVER_CONTEXT_RECENCY_SIZE (default 5)LLM_TOKEN_SAVER_TIER1_THRESHOLD (default 0.85)LLM_TOKEN_SAVER_TIER2_THRESHOLD (default 0.95)LLM_TOKEN_SAVER_CHUNK_SIZE (default 128)LLM_TOKEN_SAVER_FINAL_BUDGET_MARGIN (default is model-dependent)LLM_TOKEN_SAVER_DEBUG (set to any value to print debug logs)CONTRIBUTING.mdSECURITY.mdApache-2.0. See LICENSE.