vllora_llm

Crates.iovllora_llm
lib.rsvllora_llm
version0.1.22
created_at2025-12-02 11:56:00.92319+00
updated_at2026-01-15 13:44:06.004129+00
descriptionLLM client layer for the Vllora AI Gateway: unified chat-completions over multiple providers (OpenAI, Anthropic, Gemini, Bedrock, LangDB proxy) with optional tracing/telemetry.
homepage
repositoryhttps://github.com/vllora/vllora
max_upload_size
id1961673
size764,172
Karolis Gudiškis (karolisg)

documentation

https://vllora.dev/docs

README

Vllora LLM crate (vllora_llm)

Crates.io

This crate powers the Vllora AI Gateway’s LLM layer. It provides:

  • Unified chat-completions client over multiple providers (OpenAI-compatible, Anthropic, Gemini, Bedrock, …)
  • Gateway-native types (ChatCompletionRequest, ChatCompletionMessage, routing & tools support)
  • Streaming responses and telemetry hooks via a common ModelInstance trait
  • Tracing integration: out-of-the-box tracing support, with a console example in llm/examples/tracing (spans/events to stdout) and an OTLP example in llm/examples/tracing_otlp (send spans to external collectors such as New Relic)
  • Supported parameters: See Supported parameters for a detailed table of which parameters are honored by each provider

Use it when you want to talk to the gateway’s LLM engine from Rust code, without worrying about provider-specific SDKs.


Installation

Run cargo add vllora_llm or add to your Cargo.toml:

[dependencies]
vllora_llm = "0.1"

Quick start

Here's a minimal example to get started:

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    // 1) Build a chat completion request using gateway-native types
    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(),
        messages: vec![
            ChatCompletionMessage::new_text(
                "system".to_string(),
                "You are a helpful assistant.".to_string(),
            ),
            ChatCompletionMessage::new_text(
                "user".to_string(),
                "Stream numbers 1 to 20 in separate lines.".to_string(),
            ),
        ],
        ..Default::default()
    };

    // 2) Construct a VlloraLLMClient
    let client = VlloraLLMClient::new();

    // 3) Non-streaming: send the request and print the final reply
    let response = client
        .completions()
        .create(request.clone())
        .await?;
    
    // ... handle response
    Ok(())
}

Note: By default, VlloraLLMClient::new() fetches API keys from environment variables following the pattern VLLORA_{PROVIDER_NAME}_API_KEY. For example, for OpenAI, it will look for VLLORA_OPENAI_API_KEY.


Quick start with async-openai-compatible types

If you already build OpenAI-compatible requests (e.g. via async-openai-compat), you can send both non‑streaming and streaming completions through VlloraLLMClient.

use async_openai::types::{
    ChatCompletionRequestMessage,
    ChatCompletionRequestSystemMessageArgs,
    ChatCompletionRequestUserMessageArgs,
    CreateChatCompletionRequestArgs,
};
use tokio_stream::StreamExt;

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::error::LLMResult;
use vllora_llm::types::credentials::{ApiKeyCredentials, Credentials};

#[tokio::main]
async fn main() -> LLMResult<()> {
    // 1) Build an OpenAI-style request using async-openai-compatible types
    let openai_req = CreateChatCompletionRequestArgs::default()
        .model("gpt-4.1-mini")
        .messages([
            ChatCompletionRequestMessage::System(
                ChatCompletionRequestSystemMessageArgs::default()
                    .content("You are a helpful assistant.")
                    .build()?,
            ),
            ChatCompletionRequestMessage::User(
                ChatCompletionRequestUserMessageArgs::default()
                    .content("Stream numbers 1 to 20 in separate lines.")
                    .build()?,
            ),
        ])
        .build()?;

    // 2) Construct a VlloraLLMClient (here: direct OpenAI key)
    let client = VlloraLLMClient::new().with_credentials(Credentials::ApiKey(
        ApiKeyCredentials {
            api_key: std::env::var("VLLORA_OPENAI_API_KEY")
                .expect("VLLORA_OPENAI_API_KEY must be set"),
        },
    ));

    // 3) Non-streaming: send the request and print the final reply
    let response = client
        .completions()
        .create(openai_req.clone())
        .await?;

    if let Some(content) = &response.message().content {
        if let Some(text) = content.as_string() {
            println!("Non-streaming reply:\\n{text}");
        }
    }

    // 4) Streaming: send the same request and print chunks as they arrive
    let mut stream = client
        .completions()
        .create_stream(openai_req)
        .await?;

    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        for choice in chunk.choices {
            if let Some(delta) = choice.delta.content {
                print!("{delta}");
            }
        }
    }

    Ok(())
}

Basic usage: completions client (gateway-native)

The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.

use std::sync::Arc;

use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    // In production you would pass a real ModelInstance implementation
    // that knows how to call your configured providers / router.
    let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));

    // Build the high-level client
    let client = VlloraLLMClient::new_with_instance(instance);

    // Build a simple chat completion request
    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
        messages: vec![
            ChatCompletionMessage::new_text(
                "system".to_string(),
                "You are a helpful assistant.".to_string(),
            ),
            ChatCompletionMessage::new_text(
                "user".to_string(),
                "Say hello in one short sentence.".to_string(),
            ),
        ],
        ..Default::default()
    };

    // Send the request and get a single response message
    let response = client.completions().create(request).await?;

    let message = response.message();
    if let Some(content) = &message.content {
        if let Some(text) = content.as_string() {
            println!("Model reply: {text}");
        }
    }

    Ok(())
}

Key pieces:

  • VlloraLLMClient: wraps a ModelInstance and exposes .completions().
  • CompletionsClient::create: sends a one-shot completion request and returns a ChatCompletionMessageWithFinishReason.
  • Gateway types (ChatCompletionRequest, ChatCompletionMessage) abstract over provider-specific formats.


Streaming completions

CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:

use std::sync::Arc;

use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
    let client = VlloraLLMClient::new_with_instance(instance);

    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(),
        messages: vec![ChatCompletionMessage::new_text(
            "user".to_string(),
            "Stream the alphabet, one chunk at a time.".to_string(),
        )],
        ..Default::default()
    };

    let mut stream = client.completions().create_stream(request).await?;

    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        for choice in chunk.choices {
            if let Some(delta) = choice.delta.content {
                print!("{delta}");
            }
        }
    }

    Ok(())
}

The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.


Supported parameters

The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:

Parameter OpenAI / Proxy Anthropic Gemini Bedrock Notes
model yes yes yes yes Taken from ChatCompletionRequest.model or engine config.
max_tokens yes yes yes yes Mapped to provider-specific max_tokens / max_output_tokens.
temperature yes yes yes yes Sampling temperature.
top_p yes yes yes yes Nucleus sampling.
n no no yes no For Gemini, mapped to candidate_count; other providers always use n = 1.
stop / stop_sequences yes yes yes yes Converted to each provider’s stop / stop-sequences field.
presence_penalty yes no yes no OpenAI / Gemini only.
frequency_penalty yes no yes no OpenAI / Gemini only.
logit_bias yes no no no OpenAI-only token bias map.
user yes no no no OpenAI “end-user id” field.
seed yes no yes no Deterministic sampling where supported.
response_format (JSON schema, etc.) yes no yes no Gemini additionally normalizes JSON schema for its API.
prompt_cache_key yes no no no OpenAI-only prompt caching hint.
provider_specific.top_k no yes no no Anthropic-only: maps to Claude top_k.
provider_specific.thinking no yes no no Anthropic “thinking” options (e.g. budget tokens).
Bedrock additional_parameters map no no no yes Free-form JSON, passed through to Bedrock model params.

Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic’s ephemeral cache-control settings.

All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.

Provider-specific examples

There are runnable examples under llm/examples/ that mirror the patterns above:

  • openai: Direct OpenAI chat completions using VlloraLLMClient (non-streaming + streaming).
  • anthropic: Anthropic (Claude) chat completions via the unified client.
  • gemini: Gemini chat completions via the unified client.
  • bedrock: AWS Bedrock chat completions (Nova etc.) via the unified client.
  • proxy: Using InferenceModelProvider::Proxy("proxy_name") to call a OpenAI completions-compatible endpoint.
  • tracing: Same OpenAI-style flow as openai, but with tracing_subscriber::fmt() configured to emit spans and events to the console (stdout).
  • tracing_otlp: Shows how to wire vllora_telemetry::events::layer to an OTLP HTTP exporter (e.g. New Relic / any OTLP collector) and emit spans from VlloraLLMClient calls to a remote telemetry backend.

Each example is a standalone Cargo binary; you can cd into a directory and run:

cargo run

after setting the provider-specific environment variables noted in the example’s main.rs.

Notes

  • Real usage: In the full LangDB / Vllora gateway, concrete ModelInstance implementations are created by the core executor based on your models.yaml and routing rules; the examples above use DummyModelInstance only to illustrate the public API of the CompletionsClient.
  • Error handling: All client methods return LLMResult<T>, which wraps rich LLMError variants (network, mapping, provider errors, etc.).
  • More features: The same types in vllora_llm::types::gateway are used for tools, MCP, routing, embeddings, and image generation; see the main repository docs at https://vllora.dev/docs for higher-level gateway features.

Roadmap and issues

  • GitHub issues / roadmap: See open LLM crate issues for planned and outstanding work.
  • Planned enhancements:
    • Integrate responses API

    • Support builtin MCP tool calls

    • Gemini prompt caching supported

    • Full thinking messages support


License

Licensed under Apache License, Version 2.0.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Commit count: 751

cargo fmt