minillm

Crates.iominillm
lib.rsminillm
version0.1.1
created_at2025-09-20 18:28:03.187205+00
updated_at2025-09-25 17:51:18.083665+00
descriptionA mini inference engine for running transformer language models
homepagehttps://github.com/bmqube/minillm
repositoryhttps://github.com/bmqube/minillm
max_upload_size
id1848106
size137,443
BM Monjur Morshed (bmqube)

documentation

https://docs.rs/minillm

README

MiniLLM πŸ€–

A lightweight, efficient transformer inference engine written in Rust. MiniLLM provides a clean, well-documented implementation of GPT-2 style transformer models with support for text generation.

✨ Features

  • πŸš€ Fast Inference: Efficient tensor operations using ndarray
  • πŸ”’ Memory Safe: Written in Rust with zero-copy operations where possible
  • πŸ“¦ Easy to Use: High-level API for quick integration
  • 🎯 Well Tested: Comprehensive examples and documentation
  • πŸ”§ Extensible: Modular architecture for easy customization
  • πŸ€– GPT-2 Compatible: Load and run GPT-2 models from HuggingFace
  • πŸ“Š SafeTensors Support: Fast and secure model weight loading

πŸ—οΈ Architecture

src/
β”œβ”€β”€ lib.rs          # Library entry point and public API
β”œβ”€β”€ main.rs         # Simple CLI example (clean 27 lines)
β”œβ”€β”€ inference.rs    # High-level inference engine
β”œβ”€β”€ gpt.rs          # GPT model implementation
β”œβ”€β”€ transformer.rs  # Transformer block components
β”œβ”€β”€ attention.rs    # Multi-head attention mechanism
β”œβ”€β”€ mlp.rs          # Feed-forward network layers
β”œβ”€β”€ tensor.rs       # Tensor operations and math
β”œβ”€β”€ weights.rs      # Model weight loading (SafeTensors)
└── config.rs       # Model configuration handling

examples/
β”œβ”€β”€ basic_generation.rs  # Simple text generation
β”œβ”€β”€ interactive_chat.rs  # Interactive chat interface
└── tokenization.rs      # Tokenization examples

πŸš€ Quick Start

Library Usage

use minillm::inference::InferenceEngine;

fn main() -> minillm::Result<()> {
    // Load a GPT-2 model
    let engine = InferenceEngine::new("openai-community/gpt2")?;
    
    // Generate text
    let prompt = "The future of AI is";
    let generated = engine.generate(prompt, 20)?;
    
    println!("Generated: {}", generated);
    Ok(())
}

Command Line

# Run the main example
cargo run

# Run specific examples  
cargo run --example basic_generation
cargo run --example interactive_chat
cargo run --example tokenization

πŸ“‹ Requirements

  • Rust 1.70+
  • HuggingFace token (optional, for private models)

Set your HuggingFace token:

echo "HF_TOKEN=your_token_here" > .env

πŸ”§ Dependencies

  • ndarray - Tensor operations
  • safetensors - Model weight loading
  • tokenizers - Text tokenization
  • hf-hub - HuggingFace model downloading
  • serde - Configuration parsing

πŸ“– API Documentation

InferenceEngine

The main high-level interface:

// Create engine
let engine = InferenceEngine::new("openai-community/gpt2")?;

// Generate text
let result = engine.generate("prompt", max_tokens)?;

// Tokenization
let tokens = engine.tokenize("text")?;
let text = engine.decode(&tokens)?;

// Get model info
let config = engine.config();

Low-Level Components

For custom implementations, you can use the individual components:

  • GPTModel - Complete transformer model
  • TransformerBlock - Individual transformer layers
  • MultiHeadAttention - Attention mechanism
  • MLP - Feed-forward networks
  • Tensor - Mathematical operations

🎯 Examples

Basic Generation

cargo run --example basic_generation

Demonstrates simple text generation with model configuration display.

Interactive Chat

cargo run --example interactive_chat

Interactive command-line chat interface with the model.

Tokenization

cargo run --example tokenization

Shows tokenization, encoding/decoding, and round-trip verification.

πŸ“Š Performance

MiniLLM is designed for inference efficiency:

  • Memory: ~1GB RAM for GPT-2 (117M parameters)
  • Speed: ~10-50 tokens/second (CPU, varies by hardware)
  • Accuracy: Identical outputs to reference implementations
  • Models: Currently supports GPT-2 architecture

πŸ› οΈ Development

# Clone and build
git clone https://github.com/bmqube/minillm
cd minillm
cargo build --release

# Run tests
cargo test

# Check examples
cargo check --examples

# Generate documentation
cargo doc --open

πŸ“š Architecture Details

Transformer Implementation

  • Multi-head attention with causal masking
  • Feed-forward networks with GELU activation
  • Layer normalization and residual connections
  • Position and token embeddings

Tensor Operations

  • Dynamic 1D-4D tensor support
  • Optimized matrix multiplication
  • Element-wise operations (add, softmax, layer_norm)
  • Memory-efficient implementations

Model Loading

  • SafeTensors format support
  • Automatic model downloading from HuggingFace
  • Configuration parsing and validation
  • Error handling with detailed messages

βœ… Current Status

  • βœ… Core Architecture: Complete GPT-2 implementation
  • βœ… Inference Engine: High-level API ready
  • βœ… Examples: Comprehensive usage examples
  • βœ… Documentation: Well-documented codebase
  • βœ… Testing: All components tested and working

πŸ—ΊοΈ Roadmap

  • Performance: GPU acceleration support
  • Models: Support for larger GPT variants
  • Features: Beam search and sampling options
  • Optimization: Quantization and pruning
  • Integration: Python bindings

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ™ Acknowledgments

  • Inspired by Andrej Karpathy's educational implementations
  • Built on the excellent Rust ecosystem (ndarray, tokenizers, etc.)
  • Model weights from HuggingFace transformers library

πŸ‘¨β€πŸ’» Author

BM Monjur Morshed

Commit count: 13

cargo fmt