| Crates.io | vllora_llm |
| lib.rs | vllora_llm |
| version | 0.1.22 |
| created_at | 2025-12-02 11:56:00.92319+00 |
| updated_at | 2026-01-15 13:44:06.004129+00 |
| description | LLM client layer for the Vllora AI Gateway: unified chat-completions over multiple providers (OpenAI, Anthropic, Gemini, Bedrock, LangDB proxy) with optional tracing/telemetry. |
| homepage | |
| repository | https://github.com/vllora/vllora |
| max_upload_size | |
| id | 1961673 |
| size | 764,172 |
vllora_llm)This crate powers the Vllora AI Gateway’s LLM layer. It provides:
ChatCompletionRequest, ChatCompletionMessage, routing & tools support)ModelInstance traittracing support, with a console example in llm/examples/tracing (spans/events to stdout) and an OTLP example in llm/examples/tracing_otlp (send spans to external collectors such as New Relic)Use it when you want to talk to the gateway’s LLM engine from Rust code, without worrying about provider-specific SDKs.
Run cargo add vllora_llm or add to your Cargo.toml:
[dependencies]
vllora_llm = "0.1"
Here's a minimal example to get started:
use vllora_llm::client::VlloraLLMClient;
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
// 1) Build a chat completion request using gateway-native types
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![
ChatCompletionMessage::new_text(
"system".to_string(),
"You are a helpful assistant.".to_string(),
),
ChatCompletionMessage::new_text(
"user".to_string(),
"Stream numbers 1 to 20 in separate lines.".to_string(),
),
],
..Default::default()
};
// 2) Construct a VlloraLLMClient
let client = VlloraLLMClient::new();
// 3) Non-streaming: send the request and print the final reply
let response = client
.completions()
.create(request.clone())
.await?;
// ... handle response
Ok(())
}
Note: By default, VlloraLLMClient::new() fetches API keys from environment variables following the pattern VLLORA_{PROVIDER_NAME}_API_KEY. For example, for OpenAI, it will look for VLLORA_OPENAI_API_KEY.
If you already build OpenAI-compatible requests (e.g. via async-openai-compat), you can send both non‑streaming and streaming completions through VlloraLLMClient.
use async_openai::types::{
ChatCompletionRequestMessage,
ChatCompletionRequestSystemMessageArgs,
ChatCompletionRequestUserMessageArgs,
CreateChatCompletionRequestArgs,
};
use tokio_stream::StreamExt;
use vllora_llm::client::VlloraLLMClient;
use vllora_llm::error::LLMResult;
use vllora_llm::types::credentials::{ApiKeyCredentials, Credentials};
#[tokio::main]
async fn main() -> LLMResult<()> {
// 1) Build an OpenAI-style request using async-openai-compatible types
let openai_req = CreateChatCompletionRequestArgs::default()
.model("gpt-4.1-mini")
.messages([
ChatCompletionRequestMessage::System(
ChatCompletionRequestSystemMessageArgs::default()
.content("You are a helpful assistant.")
.build()?,
),
ChatCompletionRequestMessage::User(
ChatCompletionRequestUserMessageArgs::default()
.content("Stream numbers 1 to 20 in separate lines.")
.build()?,
),
])
.build()?;
// 2) Construct a VlloraLLMClient (here: direct OpenAI key)
let client = VlloraLLMClient::new().with_credentials(Credentials::ApiKey(
ApiKeyCredentials {
api_key: std::env::var("VLLORA_OPENAI_API_KEY")
.expect("VLLORA_OPENAI_API_KEY must be set"),
},
));
// 3) Non-streaming: send the request and print the final reply
let response = client
.completions()
.create(openai_req.clone())
.await?;
if let Some(content) = &response.message().content {
if let Some(text) = content.as_string() {
println!("Non-streaming reply:\\n{text}");
}
}
// 4) Streaming: send the same request and print chunks as they arrive
let mut stream = client
.completions()
.create_stream(openai_req)
.await?;
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
for choice in chunk.choices {
if let Some(delta) = choice.delta.content {
print!("{delta}");
}
}
}
Ok(())
}
The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
// In production you would pass a real ModelInstance implementation
// that knows how to call your configured providers / router.
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
// Build the high-level client
let client = VlloraLLMClient::new_with_instance(instance);
// Build a simple chat completion request
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
messages: vec![
ChatCompletionMessage::new_text(
"system".to_string(),
"You are a helpful assistant.".to_string(),
),
ChatCompletionMessage::new_text(
"user".to_string(),
"Say hello in one short sentence.".to_string(),
),
],
..Default::default()
};
// Send the request and get a single response message
let response = client.completions().create(request).await?;
let message = response.message();
if let Some(content) = &message.content {
if let Some(text) = content.as_string() {
println!("Model reply: {text}");
}
}
Ok(())
}
Key pieces:
VlloraLLMClient: wraps a ModelInstance and exposes .completions().CompletionsClient::create: sends a one-shot completion request and returns a ChatCompletionMessageWithFinishReason.ChatCompletionRequest, ChatCompletionMessage) abstract over provider-specific formats.CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
let client = VlloraLLMClient::new_with_instance(instance);
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![ChatCompletionMessage::new_text(
"user".to_string(),
"Stream the alphabet, one chunk at a time.".to_string(),
)],
..Default::default()
};
let mut stream = client.completions().create_stream(request).await?;
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
for choice in chunk.choices {
if let Some(delta) = choice.delta.content {
print!("{delta}");
}
}
}
Ok(())
}
The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.
The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:
| Parameter | OpenAI / Proxy | Anthropic | Gemini | Bedrock | Notes |
|---|---|---|---|---|---|
model |
yes | yes | yes | yes | Taken from ChatCompletionRequest.model or engine config. |
max_tokens |
yes | yes | yes | yes | Mapped to provider-specific max_tokens / max_output_tokens. |
temperature |
yes | yes | yes | yes | Sampling temperature. |
top_p |
yes | yes | yes | yes | Nucleus sampling. |
n |
no | no | yes | no | For Gemini, mapped to candidate_count; other providers always use n = 1. |
stop / stop_sequences |
yes | yes | yes | yes | Converted to each provider’s stop / stop-sequences field. |
presence_penalty |
yes | no | yes | no | OpenAI / Gemini only. |
frequency_penalty |
yes | no | yes | no | OpenAI / Gemini only. |
logit_bias |
yes | no | no | no | OpenAI-only token bias map. |
user |
yes | no | no | no | OpenAI “end-user id” field. |
seed |
yes | no | yes | no | Deterministic sampling where supported. |
response_format (JSON schema, etc.) |
yes | no | yes | no | Gemini additionally normalizes JSON schema for its API. |
prompt_cache_key |
yes | no | no | no | OpenAI-only prompt caching hint. |
provider_specific.top_k |
no | yes | no | no | Anthropic-only: maps to Claude top_k. |
provider_specific.thinking |
no | yes | no | no | Anthropic “thinking” options (e.g. budget tokens). |
Bedrock additional_parameters map |
no | no | no | yes | Free-form JSON, passed through to Bedrock model params. |
Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic’s ephemeral cache-control settings.
All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.
There are runnable examples under llm/examples/ that mirror the patterns above:
openai: Direct OpenAI chat completions using VlloraLLMClient (non-streaming + streaming).anthropic: Anthropic (Claude) chat completions via the unified client.gemini: Gemini chat completions via the unified client.bedrock: AWS Bedrock chat completions (Nova etc.) via the unified client.proxy: Using InferenceModelProvider::Proxy("proxy_name") to call a OpenAI completions-compatible endpoint.tracing: Same OpenAI-style flow as openai, but with tracing_subscriber::fmt() configured to emit spans and events to the console (stdout).tracing_otlp: Shows how to wire vllora_telemetry::events::layer to an OTLP HTTP exporter (e.g. New Relic / any OTLP collector) and emit spans from VlloraLLMClient calls to a remote telemetry backend.Each example is a standalone Cargo binary; you can cd into a directory and run:
cargo run
after setting the provider-specific environment variables noted in the example’s main.rs.
ModelInstance implementations are created by the core executor based on your models.yaml and routing rules; the examples above use DummyModelInstance only to illustrate the public API of the CompletionsClient.LLMResult<T>, which wraps rich LLMError variants (network, mapping, provider errors, etc.).vllora_llm::types::gateway are used for tools, MCP, routing, embeddings, and image generation; see the main repository docs at https://vllora.dev/docs for higher-level gateway features.Integrate responses API
Support builtin MCP tool calls
Gemini prompt caching supported
Full thinking messages support