| Crates.io | candle-cuda-vmm |
| lib.rs | candle-cuda-vmm |
| version | 0.1.1 |
| created_at | 2025-10-27 16:10:04.983076+00 |
| updated_at | 2025-11-15 21:47:10.758325+00 |
| description | CUDA Virtual Memory Management bindings for elastic KV cache allocation in Candle |
| homepage | |
| repository | https://github.com/ciresnave/candle_cuda_vmm |
| max_upload_size | |
| id | 1903197 |
| size | 157,616 |
CUDA Virtual Memory Management bindings for elastic KV cache allocation in Candle.
candle-cuda-vmm provides safe Rust bindings to CUDA's Virtual Memory Management (VMM) APIs, enabling elastic memory allocation for LLM inference workloads. This crate is designed to integrate with the Candle deep learning framework and supports:
use candle_cuda_vmm::{VirtualMemoryPool, Result};
use candle_core::Device;
fn main() -> Result<()> {
let device = Device::new_cuda(0)?;
// Create a pool with 128GB virtual capacity, 2MB pages
let mut pool = VirtualMemoryPool::new(
128 * 1024 * 1024 * 1024, // 128GB virtual
2 * 1024 * 1024, // 2MB pages
device,
)?;
// Allocate 1GB of physical memory on-demand
let addr = pool.allocate(0, 1024 * 1024 * 1024)?;
println!("Allocated at virtual address: 0x{:x}", addr);
// Physical memory usage: ~1GB
println!("Physical usage: {} bytes", pool.physical_memory_usage());
// Deallocate when done
pool.deallocate(0, 1024 * 1024 * 1024)?;
Ok(())
}
The crate provides two main abstractions:
Elastic memory pool with virtual memory backing. Reserves large virtual address space but only allocates physical memory on-demand.
let mut pool = VirtualMemoryPool::new(capacity, page_size, device)?;
let addr = pool.allocate(offset, size)?;
pool.deallocate(offset, size)?;
Multi-model memory pool with global physical memory limits and per-model statistics.
let mut shared_pool = SharedMemoryPool::new(physical_limit, device)?;
shared_pool.register_model("llama-7b", virtual_capacity)?;
let addr = shared_pool.allocate_for_model("llama-7b", size)?;
Based on KVCached benchmarks:
This crate was built to enable elastic KV cache management in the Lightbulb inference engine:
use candle_cuda_vmm::VirtualMemoryPool;
pub struct ElasticCacheBuilder {
virtual_pool: VirtualMemoryPool,
allocated_blocks: Vec<(usize, usize)>,
// ...
}
impl ElasticCacheBuilder {
pub fn allocate_for_tokens(&mut self, num_tokens: usize) -> Result<()> {
let size = num_tokens * self.token_size();
let offset = self.current_tokens * self.token_size();
self.virtual_pool.allocate(offset, size)?;
Ok(())
}
}
Licensed under either of:
at your option.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.