cuttle

Crates.iocuttle
lib.rscuttle
version0.1.1
created_at2025-10-24 07:02:00.313095+00
updated_at2025-10-24 07:05:58.902741+00
descriptionA large language model inference engine in Rust
homepage
repository
max_upload_size
id1898093
size170,422
Greedwolf DSS (passchaos)

documentation

README

Cuttle πŸ¦€

A CPU-based large language model inference engine implemented in pure Rust, specifically optimized for Qwen3-0.6B model.

✨ Features

  • πŸ¦€ Pure Rust Implementation: No Python dependencies, high-performance CPU inference
  • πŸ€– Qwen3-0.6B Support: Specifically optimized for Qwen3-0.6B model
  • 🌐 Bilingual Support: Supports both Chinese and English text generation
  • πŸ“¦ Auto Download: Automatic model download functionality
  • πŸ’» Command Line Interface: Easy-to-use CLI tool
  • πŸ”§ Flexible Configuration: Configurable inference parameters and tokenization system
  • πŸ“Š Performance Monitoring: Built-in performance analysis and benchmarking

πŸ—οΈ Architecture

Cuttle adopts a modular design with the following main components:

  • Tensor Module (tensor): High-performance tensor operations using pure Rust
  • Model Module (model): Transformer architecture implementation
  • Tokenizer Module (tokenizer): Text tokenization and encoding
  • Inference Engine (inference): Complete inference pipeline
  • Utils Module (utils): Performance monitoring and utility functions

πŸ“¦ Installation and Build

System Requirements

  • Rust 1.70+
  • Memory: 4GB+ recommended
  • Storage: ~2GB for model files
  • Network: Internet connection required for initial model download

Build from Source

# Clone repository
git clone https://github.com/passchaos/cuttle.git
cd cuttle

# Debug build
cargo build

# Release build (recommended for production use)
cargo build --release

# Install command line tool
cargo install --path .

πŸš€ Quick Start

1. Download Qwen3-0.6B Model

# Download Qwen3-0.6B model files to assets directory
cargo run -- download

# Force re-download (if files already exist)
cargo run -- download --force

2. Text Generation

# Chinese text generation
cargo run -- generate --prompt "δ½ ε₯½οΌŒθ―·δ»‹η»δΈ€δΈ‹θ‡ͺ己。"

# English text generation
cargo run -- generate --prompt "Hello, how are you?"

# Interactive mode
cargo run -- generate --interactive

# Custom parameters
cargo run -- generate \
  --prompt "θ―·ε†™δΈ€ι¦–ε…³δΊŽζ˜₯ε€©ηš„θ―—γ€‚" \
  --max-length 200 \
  --temperature 0.8 \
  --top-p 0.9

3. View Model Information

# Display model information
cargo run -- info

πŸ’» Programming Interface

Basic Usage

use cuttle::{
    InferenceEngine, Model, ModelConfig, 
    Tokenizer, InferenceConfig
};

// Create model configuration
let config = ModelConfig::default();
let model = Model::new(config)?;

// Create tokenizer
let mut tokenizer = cuttle::tokenizer::create_default_tokenizer();
let texts = vec!["hello world".to_string()];
tokenizer.build_vocab(&texts)?;

// Create inference engine
let engine = InferenceEngine::new(model, tokenizer);

// Generate text
let response = engine.generate("Hello, how are you?")?;
println!("Generated: {}", response);

Custom Inference Configuration

let inference_config = InferenceConfig {
    max_length: 512,
    temperature: 0.8,
    top_p: 0.9,
    top_k: 50,
    do_sample: true,
    repetition_penalty: 1.1,
};

let engine = InferenceEngine::with_config(model, tokenizer, inference_config);

Batch Processing

let prompts = vec![
    "What is AI?".to_string(),
    "Explain machine learning".to_string(),
    "How does deep learning work?".to_string(),
];

let responses = engine.generate_batch(&prompts)?;
for (prompt, response) in prompts.iter().zip(responses.iter()) {
    println!("Q: {}\nA: {}\n", prompt, response);
}

Tensor Operations

use cuttle::tensor::Tensor;

// Create tensors
let a = Tensor::randn(&[128, 256])?;
let b = Tensor::randn(&[256, 512])?;

// Matrix multiplication
let c = a.matmul(&b)?;

// Activation function
let activated = c.gelu();

// Softmax
let probs = activated.softmax(1)?;

βš™οΈ Configuration

Model Configuration (config.json)

{
  "vocab_size": 32000,
  "hidden_size": 4096,
  "num_layers": 32,
  "num_attention_heads": 32,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "rms_norm_eps": 1e-6
}

Configuration Options

  • --max-length: Maximum generation length (default: 512)
  • --temperature: Temperature parameter, controls randomness (default: 1.0)
  • --top-p: Top-p sampling parameter (default: 0.9)
  • --top-k: Top-k sampling parameter (default: 50)
  • --interactive: Interactive mode
  • --force: Force re-download model

πŸ“Š Performance Benchmarks

Run benchmarks:

# Run all benchmarks
cargo bench

# Run specific benchmarks
cargo bench tensor_operations
cargo bench inference

Performance Optimization Tips

  1. Compilation Optimization: Use --release mode
  2. Pure Rust Implementation: No external BLAS dependencies required
  3. Parallel Processing: Utilize Rayon for parallel computation
  4. Memory Management: Avoid unnecessary memory allocations

πŸ§ͺ Testing

# Run unit tests
cargo test

# Run integration tests
cargo test --test integration

# Run documentation tests
cargo test --doc

πŸ“š API Documentation

Generate and view API documentation:

cargo doc --open

πŸ› οΈ Development

Project Structure

cuttle/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ lib.rs          # Library entry point
β”‚   β”œβ”€β”€ main.rs         # Command line tool
β”‚   β”œβ”€β”€ model.rs        # Model definition
β”‚   β”œβ”€β”€ inference.rs    # Inference engine
β”‚   β”œβ”€β”€ tensor.rs       # Tensor operations
β”‚   β”œβ”€β”€ tokenizer.rs    # Tokenizer
β”‚   β”œβ”€β”€ downloader.rs   # Model downloader
β”‚   β”œβ”€β”€ error.rs        # Error handling
β”‚   └── utils.rs        # Utility functions
β”œβ”€β”€ assets/             # Model file storage directory
β”‚   └── qwen3-0.6b/    # Qwen3-0.6B model files
β”œβ”€β”€ examples/           # Example code
β”œβ”€β”€ benches/           # Performance tests
β”œβ”€β”€ tests/             # Integration tests
β”œβ”€β”€ Cargo.toml         # Project configuration
└── README.md          # Project documentation

πŸ€– Qwen3-0.6B Model Configuration

  • Parameters: 0.6B
  • Vocabulary Size: 151,936
  • Hidden Dimension: 1,024
  • Layers: 28
  • Attention Heads: 16
  • Key-Value Heads: 8 (GQA)
  • Supported Languages: Chinese, English, and other multilingual support

πŸ“ Usage Examples

Chinese Text Generation

cargo run -- generate --prompt "θ―·ε†™δΈ€ι¦–ε…³δΊŽζ˜₯ε€©ηš„θ―—γ€‚" --max-length 150

English Text Generation

cargo run -- generate --prompt "Explain quantum computing in simple terms." --max-length 200

Interactive Dialogue

cargo run -- generate --interactive

Contributing Guidelines

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Create a Pull Request

Code Style

  • Use rustfmt to format code
  • Use clippy for code linting
  • Write comprehensive documentation and tests
# Format code
cargo fmt

# Code linting
cargo clippy

πŸ”§ Troubleshooting

Common Issues

Q: Compilation errors

A: Ensure you have the latest Rust toolchain:

# Update Rust
rustup update

# Use Rust 2024 edition
rustup toolchain install nightly

Q: Slow inference speed

A: Check the following optimization options:

  • Compile with --release mode
  • Adjust batch processing size
  • Use smaller models for testing
  • Enable parallel processing

Q: High memory usage

A: Try the following approaches:

  • Reduce model size
  • Lower batch processing size
  • Use smaller sequence lengths

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • rayon - Parallel computing framework
  • serde - Serialization framework
  • clap - Command line argument parsing
  • tokio - Asynchronous runtime

πŸ”— Related Links


Cuttle - Power your AI inference with Rust πŸ¦€βœ¨

Commit count: 0

cargo fmt