| Crates.io | candle-pipelines-models |
| lib.rs | candle-pipelines-models |
| version | 0.0.2 |
| created_at | 2025-12-29 07:39:08.889846+00 |
| updated_at | 2026-01-05 00:04:30.743085+00 |
| description | Custom model implementations for candle-pipelines (patches for candle-transformers) |
| homepage | |
| repository | https://github.com/ljt019/candle-pipelines/ |
| max_upload_size | |
| id | 2010190 |
| size | 109,078 |
Patched model implementations for candle-pipelines.
This crate provides fixed versions of models from candle-transformers where the upstream implementation has design issues.
Candle's quantized models embed KV cache internally with &mut self forward:
// Upstream candle-transformers (broken design)
pub fn forward(&mut self, input: &Tensor, offset: usize) -> Result<Tensor>
This requires:
offset, model doesn't track itWe use external cache like non-quantized llama does:
// Our patched version (correct design)
pub fn forward(&self, input: &Tensor, cache: &mut Cache) -> Result<Tensor>
Benefits:
cache.current_seq_len()Arc<ModelWeights> across conversationsCachequantized_qwen3use candle_pipelines_models::quantized_qwen3::{ModelWeights, Cache};
let weights = Arc::new(ModelWeights::from_gguf(content, &mut reader, &device)?);
// Each conversation gets its own cache, shares weights
let mut cache1 = weights.new_cache();
let mut cache2 = weights.new_cache();
let logits = weights.forward(&input, &mut cache1)?;
cache1.reset(); // Clear for new conversation
quantized_gemma3Same API as qwen3.
quantized_llamaSame API as qwen3.
quantized_olmo3Same API as qwen3.