optical-embeddings

Crates.iooptical-embeddings
lib.rsoptical-embeddings
version0.3.0
created_at2025-10-23 22:22:16.196574+00
updated_at2025-10-23 22:22:16.196574+00
descriptionDeepSeek-OCR - compress text into images
homepagehttps://github.com/tuned-org-uk/optical-embeddings-rs
repositoryhttps://github.com/tuned-org-uk/optical-embeddings-rs
max_upload_size
id1897747
size385,386
Lorenzo (Mec-iS)

documentation

https://docs.rs/optical-embeddings-rs

README

Optical Embeddings

A Rust implementation of DeepSeek-OCR, a vision-language model that compresses long text documents through optical encoding using the Burn deep learning framework.

πŸ“„ About DeepSeek-OCR

This implementation is based on the paper:

DeepSeek-OCR: Contexts Optical Compression Haoran Wei, Yaofeng Sun, Yukun Li DeepSeek-AI arXiv:2510.18234v1 [cs.CV] 21 Oct 2024 Paper (arXiv) | Official Repository

Key Innovation

DeepSeek-OCR addresses the computational challenges of processing long textual contexts in Large Language Models (LLMs) by leveraging context optical compressionβ€”a novel approach that treats rendered text images as an efficient compression medium. Instead of processing thousands of text tokens, the model encodes document images into a compact set of vision tokens.

Architecture Highlights

The system consists of two main components:

  1. DeepEncoder (~380M parameters): A hybrid vision encoder combining:
    • SAM-base (80M): Window attention for efficient local feature extraction
    • 16Γ— Convolutional Compressor: Reduces spatial dimensions via two stride-2 conv layers
    • CLIP-large (300M): Global attention for semantic understanding
  2. DeepSeek3B-MoE Decoder (570M activated): A Mixture-of-Experts language model that reconstructs text from compressed vision tokens

Compression Performance

According to the paper's findings on the Fox benchmark:

  • ~10Γ— compression: Achieves 97% OCR decoding precision
  • ~20Γ— compression: Maintains 60% accuracy
  • Token efficiency: Processes 1000 text words using only 100 vision tokens

The model supports multiple resolution modes optimized for different compression ratios:

Mode Resolution Vision Tokens Compression Target
Tiny 512Γ—512 64 Ultra-fast inference
Small 640Γ—640 100 Balanced performance
Base 1024Γ—1024 256 Default (10Γ— target)
Large 1280Γ—1280 400 High-precision OCR
Gundam Dynamic <800 Complex documents

🎯 Implementation Features

This Rust implementation provides:

  • βœ… Complete DeepEncoder architecture with SAM and CLIP encoders
  • βœ… Window & global attention mechanisms for efficient processing
  • βœ… 16Γ— spatial compression via convolutional layers
  • βœ… Multi-resolution support (Tiny/Small/Base/Large modes)
  • βœ… GPU acceleration via WGPU (cross-platform) or CUDA (NVIDIA)
  • βœ… Information-theoretic compression metrics for analysis
  • βœ… Production-ready with proper error handling and logging

Run

  1. Download your favourite font in assets/font.ttf
  2. cargo run --release
  3. check the output image

Tests

cargo test --all-features -- --nocapture

Build Commands

CPU only (default):

cargo build --release
cargo run --release

With WGPU (GPU - works with NVIDIA, AMD, Intel, Apple Silicon):

cargo build --release --features wgpu
cargo run --release --features wgpu

With CUDA (NVIDIA only - fastest):

# Make sure CUDA toolkit is installed first:
# Ubuntu/Debian: sudo apt install nvidia-cuda-toolkit
# Or download from: https://developer.nvidia.com/cuda-downloads

cargo build --release --features cuda
cargo run --release --features cuda

Check which GPU you have:

# NVIDIA
nvidia-smi

# AMD/Intel/General
vulkaninfo | grep -i "device name"

# Or just try WGPU (works with most GPUs)
cargo run --release --features wgpu

Metrics

Try: cargo test -- test_information_compression_pipeline --nocapture

Information compression:

╔═══════════════════════════════════════════════════════════╗
β•‘          Optical Embeddings Information Analysis               β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

πŸ“ TEXT INFORMATION:
  β”œβ”€ Bytes:                   641
  β”œβ”€ Characters:              641
  β”œβ”€ Words:                    87
  β”œβ”€ Unique chars:             42
  └─ Entropy (bits):       4.3794

πŸ–ΌοΈ  IMAGE INFORMATION:
  β”œβ”€ Bytes:                786432
  β”œβ”€ Pixels:               262144
  β”œβ”€ Unique colors:             2
  └─ Entropy (bits):       0.1507

🎯 VISION TOKENS:
  β”œβ”€ Token count:              64
  β”œβ”€ Embedding dim:          1024
  └─ Total values:          65536

πŸ“Š COMPRESSION METRICS:
  β”œβ”€ Textβ†’Image:           0.0008Γ— (smaller)
  β”œβ”€ Textβ†’Tokens:          0.0098Γ— (smaller)
  β”œβ”€ Imageβ†’Tokens:        12.0000Γ— (compressed)
  └─ Effective (ent):     39.5030Γ— (compression)

πŸ“ˆ INFORMATION FLOW:
  Original text:    641 bytes (4.3793884228266 bits entropy)
  Rendered image:   786432 bytes (0.15070330510950625 bits entropy)
  Vision tokens:    64 tokens Γ— 1024 dims = 65536 values
  Effective rate:   80.12 bits/token

πŸ“Š COMPRESSION RESULTS:
  β”œβ”€ Text words: 87
  β”œβ”€ Vision tokens: 64
  β”œβ”€ Words/token: 1.36
  └─ Spatial compression: 1024 patches β†’ 64 tokens = 16.0Γ— reduction

βœ… Compression test passed!
   - Achieved 16Γ— spatial compression (1024 β†’ 64 tokens)
   - Word-to-token ratio: 1.36
   - βœ… Effective compression: 1.36 words per vision token
test tests::tests::test_information_compression_pipeline ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 7 filtered out; finished in 1.28s
Commit count: 0

cargo fmt