| Crates.io | ollama-proxy-rs |
| lib.rs | ollama-proxy-rs |
| version | 0.1.1 |
| created_at | 2025-11-26 17:45:57.00064+00 |
| updated_at | 2025-11-26 17:45:57.00064+00 |
| description | A lightweight Rust proxy for Ollama that intelligently adjusts request parameters to match each model's training configuration |
| homepage | https://github.com/arosboro/ollama-proxy |
| repository | https://github.com/arosboro/ollama-proxy |
| max_upload_size | |
| id | 1951922 |
| size | 176,767 |
A lightweight Rust proxy for Ollama that intelligently adjusts request parameters to match each model's training configuration.
Some AI clients (like Elephas) send the same context length parameter for all models. This causes issues when:
This proxy sits between your client and Ollama, automatically:
n_ctx_train)num_ctx if it exceeds the model's capabilitiesnum_predict to limit outputcargo build --release
# Default: Listen on 127.0.0.1:11435, proxy to 127.0.0.1:11434
cargo run --release
# Or with custom settings:
OLLAMA_HOST=http://127.0.0.1:11434 PROXY_PORT=11435 RUST_LOG=info cargo run --release
Point your AI client (Elephas, etc.) to the proxy instead of Ollama directly:
Before: http://127.0.0.1:11434
After: http://127.0.0.1:11435
The proxy will log all requests and modifications:
📨 Incoming request: POST /v1/embeddings
📋 Request body: {
"model": "nomic-embed-text",
"input": "test",
"options": {
"num_ctx": 131072
}
}
🔍 Detected model: nomic-embed-text
📊 Model metadata - n_ctx_train: 8192
⚠️ num_ctx (131072) exceeds model training context (8192)
✏️ Modified options.num_ctx: 131072 → 8192
🔧 ContextLimitModifier applied modifications
📬 Response status: 200 OK
Environment variables:
OLLAMA_HOST - Target Ollama server (default: http://127.0.0.1:11434)PROXY_PORT - Port to listen on (default: 11435)RUST_LOG - Log level: error, warn, info, debug, trace (default: info)Prevent Ollama stalls with large contexts:
MAX_CONTEXT_OVERRIDE - Hard cap for context size regardless of model support (default: 16384)REQUEST_TIMEOUT_SECONDS - Timeout for requests to Ollama (default: 120)Why This Matters:
Models may claim to support very large contexts (e.g., 131K tokens), but Ollama can stall or hang when actually processing them, especially with flash attention enabled. The MAX_CONTEXT_OVERRIDE provides a safety limit.
Recommended Settings:
# Conservative (most reliable)
MAX_CONTEXT_OVERRIDE=16384 REQUEST_TIMEOUT_SECONDS=120 cargo run --release
# Moderate (test with your hardware)
MAX_CONTEXT_OVERRIDE=32768 REQUEST_TIMEOUT_SECONDS=180 cargo run --release
# Aggressive (may cause stalls on some systems)
MAX_CONTEXT_OVERRIDE=65536 REQUEST_TIMEOUT_SECONDS=300 cargo run --release
Note: If requests time out, reduce MAX_CONTEXT_OVERRIDE first before increasing timeout.
THE CRITICAL FIX FOR TIMEOUTS:
The proxy automatically injects num_predict into all chat requests to prevent infinite generation loops.
The Problem:
num_predict is -1 (infinite)How the Proxy Fixes This:
messages array)num_predict is already setnum_predict:
max_tokens from request if available (e.g., 4096 from Elephas)Example:
// Your request:
{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 2048
}
// Proxy automatically adds:
{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 2048,
"options": {
"num_predict": 2048 // ← Added by proxy
}
}
Why This Matters:
Without num_predict, a simple "say hello" request can generate for 3+ minutes, filling the entire context buffer with elaborations, examples, and repetitions until it crashes or times out.
Override if Needed:
If you want different generation limits, set num_predict explicitly in your request - the proxy preserves existing values.
For large embeddings inputs, the proxy can automatically chunk text to prevent Ollama memory errors:
MAX_EMBEDDING_INPUT_LENGTH - Maximum characters per embedding input (default: 2000)ENABLE_AUTO_CHUNKING - Enable automatic chunking for large inputs (default: true)How Chunking Works:
When an embeddings request contains text longer than MAX_EMBEDDING_INPUT_LENGTH:
Example:
# Allow larger inputs before chunking (4000 characters)
MAX_EMBEDDING_INPUT_LENGTH=4000 cargo run --release
# Disable chunking (return error for large inputs)
ENABLE_AUTO_CHUNKING=false cargo run --release
Performance Considerations:
Flash Attention is an optimization technique that speeds up inference and reduces memory usage. Ollama can enable it automatically for supported models.
Flash Attention is global only (environment variable), not per-request:
# Let Ollama decide (RECOMMENDED - unset the variable)
unset OLLAMA_FLASH_ATTENTION
ollama serve
# Explicitly enable (may cause issues with large contexts)
export OLLAMA_FLASH_ATTENTION=1
ollama serve
# Explicitly disable (may help with large context stalls)
export OLLAMA_FLASH_ATTENTION=0
ollama serve
Symptoms:
Why This Happens: Flash attention with very large contexts can trigger memory allocation deadlocks or exceed Metal's working set limits on macOS, especially with M-series chips.
Solutions:
Unset flash attention (let Ollama decide per-model):
unset OLLAMA_FLASH_ATTENTION
pkill ollama
ollama serve
Reduce context size (use the proxy's safety cap):
MAX_CONTEXT_OVERRIDE=16384 cargo run --release
Test systematically to find your hardware's limits:
./test_context_limits.sh gpt-oss:20b
✅ DO:
OLLAMA_FLASH_ATTENTION unset (let Ollama auto-detect)MAX_CONTEXT_OVERRIDE=16384 for reliabilitytest_context_limits.sh to find your system's sweet spot❌ DON'T:
false globally (disables it for all models)Symptoms:
SIGABRT: abort or output_reserve: reallocating output bufferCause: Ollama's embedding models crash when trying to allocate large buffers for very long inputs.
Solutions:
Enable chunking (should be on by default):
ENABLE_AUTO_CHUNKING=true cargo run --release
Reduce chunk size if still seeing errors:
MAX_EMBEDDING_INPUT_LENGTH=1500 cargo run --release
Check Ollama logs for details:
tail -f ~/.ollama/logs/server.log
Symptoms:
Cause:
Input exceeds MAX_EMBEDDING_INPUT_LENGTH and chunking is disabled.
Solution: Enable chunking:
ENABLE_AUTO_CHUNKING=true cargo run --release
Symptoms:
Cause: Large inputs are being chunked and processed sequentially.
This is expected behavior! Chunking prevents crashes but adds latency.
To improve speed:
MAX_EMBEDDING_INPUT_LENGTH if your hardware can handle it/v1/embeddings → Ollama /api/embedoptions.num_ctx with correct value for the modelClient (Elephas)
↓ OpenAI API format (/v1/embeddings)
Proxy (Port 11435)
↓ Translates to native Ollama API (/api/embed)
↓ Injects options.num_ctx based on model
Ollama (Port 11434)
↓ Returns native response
Proxy
↓ Translates back to OpenAI format
Client receives OpenAI-compatible response
Key Innovation: The proxy acts as a translation layer, converting between OpenAI's API format (which doesn't support runtime options) and Ollama's native API (which does), enabling per-request parameter control without changing global settings.
The modifier framework is designed for easy extension:
pub trait ParameterModifier {
fn modify(&self, json: &mut Value, metadata: &ModelMetadata) -> bool;
fn name(&self) -> &str;
}
Add new modifiers in src/modifier.rs and register them in apply_modifiers().
cargo test
MIT