| Crates.io | candle-coreml |
| lib.rs | candle-coreml |
| version | 0.3.1 |
| created_at | 2025-07-24 20:25:53.530664+00 |
| updated_at | 2025-09-10 19:57:23.155316+00 |
| description | CoreML inference engine for Candle tensors - provides Apple CoreML/ANE integration with real tokenization, safety fixes, and model calibration awareness |
| homepage | https://github.com/mazhewitt/candle-cormel |
| repository | https://github.com/mazhewitt/candle-cormel |
| max_upload_size | |
| id | 1766893 |
| size | 20,067,247 |
CoreML inference engine for Candle tensors - providing Apple CoreML integration for Rust machine learning applications.
candle-coreml is a standalone crate that bridges Candle tensors with Apple's CoreML framework, enabling efficient on-device inference on macOS and iOS. Unlike generic CoreML bindings, this crate provides:
Add to your Cargo.toml:
[dependencies]
candle-coreml = "0.3.1"
candle-core = "0.9.1"
Basic usage with UnifiedModelLoader (Recommended):
use candle_coreml::UnifiedModelLoader;
// Load model directly from HuggingFace with automatic setup
let loader = UnifiedModelLoader::new()?;
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;
// Generate text using the new API
let response = model.complete_text(
"Hello, how are you?",
50, // max tokens
0.8, // temperature
)?;
println!("Response: {}", response);
Manual CoreML model loading:
use candle_coreml::{CoreMLModel, ModelConfig};
// Load model config (typically auto-generated)
let config = ModelConfig::load_from_file("model_config.json")?;
// Load CoreML model components
let model = CoreMLModel::load_from_file("model.mlpackage", &config)?;
// Create input tensor
let input = candle_core::Tensor::zeros((1, 128), candle_core::DType::I64, &candle_core::Device::Cpu)?;
// Run inference
let output = model.forward(&[input])?;
ANEMLL (pronounced "animal") provides state-of-the-art Apple Neural Engine optimizations for large language models. Our crate provides comprehensive support for ANEMLL's multi-component architecture.
ANEMLL converts large models into multiple specialized components that maximize Apple Neural Engine utilization:
| Model | Size | Context | Components | Status |
|---|---|---|---|---|
| Qwen 3 | 0.5B-7B | 512-32K | 3-part split | ✅ Fully Supported |
| Qwen 2.5 | 0.5B-7B | 512-32K | 3-part split | ✅ Fully Supported |
ANEMLL splits models into specialized components for optimal ANE performance:
Input Tokens → [Embeddings] → [FFN Transformer] → [LM Head] → Output Logits
↓ ↓ ↓
embeddings. FFN_chunk_01. lm_head.
mlmodelc mlmodelc mlmodelc
Embeddings Model (qwen_embeddings.mlmodelc)
[batch, seq_len, hidden_dim]FFN Model (qwen_FFN_PF_lut8_chunk_01of01.mlmodelc)
[batch, seq_len, hidden_dim]LM Head Model (qwen_lm_head_lut8.mlmodelc)
[batch, 1, hidden_dim][batch, 1, vocab_size]use candle_coreml::UnifiedModelLoader;
// Load complete multi-component model with automatic setup
let loader = UnifiedModelLoader::new()?;
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;
// Generate text using the new API methods
let response = model.complete_text(
"Hello, how are you?",
50, // max tokens
0.8, // temperature
)?;
// Or use the more advanced generation method
let tokens = model.generate_tokens_topk_temp(
"Hello, how are you?",
50, // max tokens
0.8, // temperature
Some(50), // top_k
)?;
For advanced use cases, load components individually:
use candle_coreml::{CoreMLModel, ModelConfig, QwenModel, QwenConfig};
// Option 1: Load from directory with auto-generated config
let model_dir = "/path/to/downloaded/model";
let mut model = QwenModel::load_from_directory(&model_dir, None)?;
// Option 2: Manual component loading with ModelConfig
let config = ModelConfig::load_from_file("model_config.json")?;
let embeddings = CoreMLModel::load_from_file("embeddings.mlpackage", &config)?;
let ffn_prefill = CoreMLModel::load_from_file("ffn_prefill.mlpackage", &config)?;
let ffn_infer = CoreMLModel::load_from_file("ffn_infer.mlpackage", &config)?;
let lm_head = CoreMLModel::load_from_file("lm_head.mlpackage", &config)?;
// Use the high-level API for text generation
let response = model.complete_text("Hello!", 20, 0.7)?;
# Recommended API demonstration
cargo run --example recommended_api_demo
# Multi-component chat with Qwen models (downloads ~2GB models)
cargo run --example qwen_chat
# Test thinking behavior and quality
cargo run --example test_thinking_behavior
cargo run --example proper_quality_test
# Performance comparisons
cargo run --example compare_loading_approaches
ANEMLL models are hosted on HuggingFace and downloaded automatically:
# Models are cached in ~/.cache/candle-coreml/
# First run downloads all components (~2GB for Qwen 0.6B)
# Available models:
# - anemll/anemll-Qwen-Qwen3-0.6B-ctx512_0.3.4
# - anemll/anemll-Qwen-Qwen2.5-0.5B-ctx512_0.3.4
# - More models available at: https://huggingface.co/anemll
ANEMLL provides reference apps showing production usage:
Our crate provides the missing piece for Rust developers wanting to use ANEMLL's optimized models:
This makes ANEMLL's advanced ANE optimizations accessible to the entire Candle ecosystem.
📚 Complete ANEMLL Integration Guide - Comprehensive documentation covering architecture, usage patterns, and production deployment.
This crate follows the inference engine pattern rather than treating CoreML as a device backend:
| Feature | coreml-rs | candle-coreml |
|---|---|---|
| Bindings | swift-bridge | objc2 direct |
| Purpose | Generic CoreML | Candle tensor integration |
| API | Raw CoreML interface | Candle patterns (T5-like) |
| Error Handling | Generic | Candle error types |
| Device Support | Generic | CPU/Metal validation |
👉 BERT CoreML Inference - Step-by-Step Guide
A comprehensive tutorial covering:
.mlpackage/.mlmodelc files)Not all models run on the ANE! Apple's Neural Engine has strict requirements:
Recommendation: Use Apple's pre-optimized models (like their optimized BERT) for guaranteed ANE acceleration, or stick with Metal/CPU backends for general use.
ANE (fastest, most efficient) > GPU/Metal (fast) > CPU (most compatible)
Apple automatically chooses the best available backend, but your model must be ANE-compatible to benefit from the fastest option.
The recommended approach for loading and using models is through the UnifiedModelLoader, which handles:
use candle_coreml::UnifiedModelLoader;
// Create loader (initializes cache and config generation)
let loader = UnifiedModelLoader::new()?;
// Load any ANEMLL model from HuggingFace
let mut model = loader.load_model("anemll/anemll-Qwen-Qwen3-0.6B-LUT888-ctx512_0.3.4")?;
// Available generation methods:
// 1. High-level text completion (recommended)
let response = model.complete_text("Hello, world!", 50, 0.8)?;
// 2. Advanced token generation with top-k sampling
let tokens = model.generate_tokens_topk_temp("Hello!", 20, 0.7, Some(40))?;
// 3. Single token prediction
let next_token = model.forward_text("Hello")?;
// 4. Text generation with parameters
let result = model.generate_text_with_params("Hello!", 30, 0.9)?;
The QwenModel provides several methods for text generation:
| Method | Description | Use Case |
|---|---|---|
complete_text(prompt, max_tokens, temperature) |
Recommended - High-level text completion | General text generation |
generate_tokens_topk_temp(prompt, max_tokens, temp, top_k) |
Advanced generation with top-k sampling | Fine-tuned control over generation |
forward_text(text) |
Single token prediction | Next token prediction, embeddings |
generate_text_with_params(prompt, max_tokens, temperature) |
Text generation with custom parameters | Custom generation logic |
generate_tokens() |
Deprecated - Use generate_tokens_topk_temp() instead |
Legacy compatibility only |
Models and configs are cached automatically:
// Models cached in: ~/.cache/candle-coreml/models/
// Configs cached in: ~/.cache/candle-coreml/configs/
// Clear caches if needed
use candle_coreml::CacheManager;
let cache = CacheManager::new()?;
// cache.clear_model_cache()?; // if needed
Complex multi-component language models (e.g. ANEMLL Qwen variants, custom fine-tunes) are described declaratively using a ModelConfig JSON file. This removes hardcoded shapes and enables:
ffn_execution = split | unified){
"model_info": { "model_type": "qwen", "path": "/path/to/model" },
"shapes": { "batch_size": 64, "context_length": 256, "hidden_size": 1024, "vocab_size": 151669 },
"components": {
"embeddings": { "file_path": "embeddings.mlpackage", "inputs": { "input_ids": {"shape": [1,64], "data_type": "INT32", "name": "input_ids" } }, "outputs": { "hidden_states": {"shape": [1,64,1024], "data_type": "FLOAT16", "name": "hidden_states" } }, "functions": [] },
"ffn_prefill": { "file_path": "ffn_prefill.mlpackage", "inputs": { "hidden_states": {"shape": [1,64,1024], "data_type": "FLOAT16","name":"hidden_states"}, "position_ids": {"shape":[64],"data_type":"INT32","name":"position_ids"}, "causal_mask": {"shape":[1,1,64,256],"data_type":"FLOAT16","name":"causal_mask"}, "current_pos": {"shape":[1],"data_type":"INT32","name":"current_pos"} }, "outputs": { "output_hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"output_hidden_states"} }, "functions":["prefill"] },
"ffn_infer": { "file_path": "ffn_infer.mlpackage", "inputs": { "hidden_states": {"shape": [1,1,1024], "data_type": "FLOAT16","name":"hidden_states"}, "position_ids": {"shape":[1],"data_type":"INT32","name":"position_ids"}, "causal_mask": {"shape":[1,1,1,256],"data_type":"FLOAT16","name":"causal_mask"}, "current_pos": {"shape":[1],"data_type":"INT32","name":"current_pos"} }, "outputs": { "output_hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"output_hidden_states"} }, "functions":["infer"] },
"lm_head": { "file_path": "lm_head.mlpackage", "inputs": { "hidden_states": {"shape":[1,1,1024],"data_type":"FLOAT16","name":"hidden_states" } }, "outputs": { "logits1": {"shape":[1,1,9480],"data_type":"FLOAT16","name":"logits1"}, "logits2": {"shape":[1,1,9479],"data_type":"FLOAT16","name":"logits2"} }, "functions": [] }
},
"ffn_execution": "split"
}
| Mode | When | Behavior |
|---|---|---|
unified |
Single CoreML package exposes prefill & infer functions |
Shared file, one state, batched prefill then token-by-token infer |
split |
Separate ffn_prefill & ffn_infer model files |
Distinct model files; state created from prefill model and reused for infer |
If ffn_execution is omitted, the system infers split when ffn_prefill.file_path != ffn_infer.file_path.
Prefill can be either batch (process full sequence in one call) or sequential (one token at a time). Sequential mode is auto-enabled when ffn_prefill.hidden_states shape has seq_len == 1 (e.g. [1,1,H]) indicating a single-token CoreML prefill variant. This matches certain fine-tuned or distilled models exported with single-token kernels.
The LM head may output logits1..logitsN. The library detects count dynamically and stitches them into a contiguous logits tensor. No manual configuration needed beyond listing outputs.
ModelConfig::validate() checks basic consistency; validate_internal_wiring() ensures adjacent component tensor shapes align (e.g. embeddings → ffn_prefill). Warnings are logged but loading proceeds to aid iterative development.
See CUSTOM_MODEL_GUIDE.md for deep-dive shape discovery tooling and advanced customization.
Legacy filename pattern discovery has been removed. Always set file_path for each component—this avoids ambiguity and improves reproducibility.
| Symptom | Likely Cause | Fix |
|---|---|---|
MultiArray shape (64) does not match shape (1) |
Prefill or infer mismatch between batch vs single-token tensors | Ensure correct ffn_prefill / ffn_infer shapes or adjust to sequential mode by setting prefill hidden_states to [1,1,H] |
| Missing logits concatenation | Outputs not named logits* |
Rename outputs or manually post-process |
| Incorrect token length padding | Embeddings input_ids shape mismatch |
Align embeddings.inputs.input_ids.shape with expected max prefill length |
| LM head shape mismatch | output_hidden_states vs lm_head.hidden_states differ |
Regenerate config with discovery tool; fix shapes |
For detailed examples see configs/ directory (e.g. anemll-qwen3-0.6b.json).
The examples/ directory demonstrates various usage patterns:
# Start with the recommended API
cargo run --example recommended_api_demo
# Interactive Qwen chat (downloads ~2GB on first run)
cargo run --example qwen_chat
# Test model quality
cargo run --example proper_quality_test
# Compare loading approaches
cargo run --example compare_loading_approaches
This is an independent project providing CoreML integration for the Candle ecosystem. Contributions welcome!
Licensed under either of Apache License, Version 2.0 or MIT license at your option.