candle-pipelines-models

Crates.io	candle-pipelines-models
lib.rs	candle-pipelines-models
version	0.0.2
created_at	2025-12-29 07:39:08.889846+00
updated_at	2026-01-05 00:04:30.743085+00
description	Custom model implementations for candle-pipelines (patches for candle-transformers)
homepage
repository	https://github.com/ljt019/candle-pipelines/
max_upload_size
id	2010190
size	109,078

Lucien Thomas (ljt019)

documentation

README

candle-pipelines-models

Patched model implementations for candle-pipelines.

This crate provides fixed versions of models from candle-transformers where the upstream implementation has design issues.

The Problem

Candle's quantized models embed KV cache internally with &mut self forward:

// Upstream candle-transformers (broken design)
pub fn forward(&mut self, input: &Tensor, offset: usize) -> Result<Tensor>

This requires:

Manual position tracking - you pass offset, model doesn't track it
Clone entire model weights for each conversation (to get independent KV cache)
No sharing - can't have multiple conversations use same weights

The Fix

We use external cache like non-quantized llama does:

// Our patched version (correct design)
pub fn forward(&self, input: &Tensor, cache: &mut Cache) -> Result<Tensor>

Benefits:

Automatic position tracking - cache.current_seq_len()
No weight cloning - share Arc<ModelWeights> across conversations
Independent caches - each conversation gets its own Cache

Patched Models

`quantized_qwen3`

use candle_pipelines_models::quantized_qwen3::{ModelWeights, Cache};

let weights = Arc::new(ModelWeights::from_gguf(content, &mut reader, &device)?);

// Each conversation gets its own cache, shares weights
let mut cache1 = weights.new_cache();
let mut cache2 = weights.new_cache();

let logits = weights.forward(&input, &mut cache1)?;
cache1.reset(); // Clear for new conversation

`quantized_gemma3`

Same API as qwen3.

`quantized_llama`

Same API as qwen3.

`quantized_olmo3`

Same API as qwen3.

Commit count: 0

candle-pipelines-models

documentation

README

candle-pipelines-models

The Problem

The Fix

Patched Models

quantized_qwen3

quantized_gemma3

quantized_llama

quantized_olmo3

cargo fmt

`quantized_qwen3`

`quantized_gemma3`

`quantized_llama`

`quantized_olmo3`