| Crates.io | cuttle |
| lib.rs | cuttle |
| version | 0.1.1 |
| created_at | 2025-10-24 07:02:00.313095+00 |
| updated_at | 2025-10-24 07:05:58.902741+00 |
| description | A large language model inference engine in Rust |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1898093 |
| size | 170,422 |
A CPU-based large language model inference engine implemented in pure Rust, specifically optimized for Qwen3-0.6B model.
Cuttle adopts a modular design with the following main components:
tensor): High-performance tensor operations using pure Rustmodel): Transformer architecture implementationtokenizer): Text tokenization and encodinginference): Complete inference pipelineutils): Performance monitoring and utility functions# Clone repository
git clone https://github.com/passchaos/cuttle.git
cd cuttle
# Debug build
cargo build
# Release build (recommended for production use)
cargo build --release
# Install command line tool
cargo install --path .
# Download Qwen3-0.6B model files to assets directory
cargo run -- download
# Force re-download (if files already exist)
cargo run -- download --force
# Chinese text generation
cargo run -- generate --prompt "δ½ ε₯½οΌθ―·δ»η»δΈδΈθͺε·±γ"
# English text generation
cargo run -- generate --prompt "Hello, how are you?"
# Interactive mode
cargo run -- generate --interactive
# Custom parameters
cargo run -- generate \
--prompt "θ―·εδΈι¦ε
³δΊζ₯倩ηθ―γ" \
--max-length 200 \
--temperature 0.8 \
--top-p 0.9
# Display model information
cargo run -- info
use cuttle::{
InferenceEngine, Model, ModelConfig,
Tokenizer, InferenceConfig
};
// Create model configuration
let config = ModelConfig::default();
let model = Model::new(config)?;
// Create tokenizer
let mut tokenizer = cuttle::tokenizer::create_default_tokenizer();
let texts = vec!["hello world".to_string()];
tokenizer.build_vocab(&texts)?;
// Create inference engine
let engine = InferenceEngine::new(model, tokenizer);
// Generate text
let response = engine.generate("Hello, how are you?")?;
println!("Generated: {}", response);
let inference_config = InferenceConfig {
max_length: 512,
temperature: 0.8,
top_p: 0.9,
top_k: 50,
do_sample: true,
repetition_penalty: 1.1,
};
let engine = InferenceEngine::with_config(model, tokenizer, inference_config);
let prompts = vec![
"What is AI?".to_string(),
"Explain machine learning".to_string(),
"How does deep learning work?".to_string(),
];
let responses = engine.generate_batch(&prompts)?;
for (prompt, response) in prompts.iter().zip(responses.iter()) {
println!("Q: {}\nA: {}\n", prompt, response);
}
use cuttle::tensor::Tensor;
// Create tensors
let a = Tensor::randn(&[128, 256])?;
let b = Tensor::randn(&[256, 512])?;
// Matrix multiplication
let c = a.matmul(&b)?;
// Activation function
let activated = c.gelu();
// Softmax
let probs = activated.softmax(1)?;
{
"vocab_size": 32000,
"hidden_size": 4096,
"num_layers": 32,
"num_attention_heads": 32,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"rms_norm_eps": 1e-6
}
--max-length: Maximum generation length (default: 512)--temperature: Temperature parameter, controls randomness (default: 1.0)--top-p: Top-p sampling parameter (default: 0.9)--top-k: Top-k sampling parameter (default: 50)--interactive: Interactive mode--force: Force re-download modelRun benchmarks:
# Run all benchmarks
cargo bench
# Run specific benchmarks
cargo bench tensor_operations
cargo bench inference
--release mode# Run unit tests
cargo test
# Run integration tests
cargo test --test integration
# Run documentation tests
cargo test --doc
Generate and view API documentation:
cargo doc --open
cuttle/
βββ src/
β βββ lib.rs # Library entry point
β βββ main.rs # Command line tool
β βββ model.rs # Model definition
β βββ inference.rs # Inference engine
β βββ tensor.rs # Tensor operations
β βββ tokenizer.rs # Tokenizer
β βββ downloader.rs # Model downloader
β βββ error.rs # Error handling
β βββ utils.rs # Utility functions
βββ assets/ # Model file storage directory
β βββ qwen3-0.6b/ # Qwen3-0.6B model files
βββ examples/ # Example code
βββ benches/ # Performance tests
βββ tests/ # Integration tests
βββ Cargo.toml # Project configuration
βββ README.md # Project documentation
cargo run -- generate --prompt "θ―·εδΈι¦ε
³δΊζ₯倩ηθ―γ" --max-length 150
cargo run -- generate --prompt "Explain quantum computing in simple terms." --max-length 200
cargo run -- generate --interactive
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)rustfmt to format codeclippy for code linting# Format code
cargo fmt
# Code linting
cargo clippy
Q: Compilation errors
A: Ensure you have the latest Rust toolchain:
# Update Rust
rustup update
# Use Rust 2024 edition
rustup toolchain install nightly
Q: Slow inference speed
A: Check the following optimization options:
--release modeQ: High memory usage
A: Try the following approaches:
This project is licensed under the MIT License - see the LICENSE file for details.
Cuttle - Power your AI inference with Rust π¦β¨