kreuzberg-tesseract

Crates.iokreuzberg-tesseract
lib.rskreuzberg-tesseract
version4.1.2
created_at2025-11-15 08:28:00.057882+00
updated_at2026-01-25 12:35:55.414638+00
descriptionRust bindings for Tesseract OCR with cross-compilation, C++17, and caching improvements
homepagehttps://kreuzberg.dev
repositoryhttps://github.com/kreuzberg-dev/kreuzberg
max_upload_size
id1934151
size245,401
Na'aman Hirschfeld (Goldziher)

documentation

https://docs.kreuzberg.dev

README

kreuzberg-tesseract

Rust bindings for Tesseract OCR with built-in compilation of Tesseract and Leptonica libraries. Provides a safe and idiomatic Rust interface to Tesseract's functionality while handling the complexity of compiling the underlying C++ libraries.

Based on the original tesseract-rs by Cafer Can Gündoğdu, this maintained version adds critical improvements for production use:

  • C++17 Support: Upgraded for Tesseract 5.5.1 which requires C++17 filesystem
  • Cross-Compilation: Fixed CXX compiler detection for cross-platform builds
  • Architecture Validation: Validates target architecture before using cached libraries
  • Windows Static Linking: Fixed MSVC static linking issues
  • Build Caching: Improved caching with OUT_DIR-based cache directory
  • MinGW Support: Added support for MinGW toolchains

Features

  • Safe Rust bindings for Tesseract OCR
  • Multiple linking options:
    • Static linking (default): Built-in compilation with no runtime dependencies
    • Dynamic linking: Link to system-installed libraries for faster builds
  • Uses existing Tesseract training data (expects English data for tests)
  • High-level Rust API for common OCR tasks
  • Caching of compiled libraries for faster subsequent builds
  • Support for multiple operating systems (Linux, macOS, Windows)

Installation

Static Linking (Default)

Static linking builds Tesseract and Leptonica from source and embeds them in your binary. No runtime dependencies required:

[dependencies]
kreuzberg-tesseract = "1.0.0-rc.1"
# or explicitly:
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["static-linking"] }

Dynamic Linking

Dynamic linking uses system-installed Tesseract and Leptonica libraries. Faster builds, but requires libraries installed on the system:

[dependencies]
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["dynamic-linking"], default-features = false }

System requirements for dynamic linking:

  • Tesseract 5.x libraries installed (libtesseract, libleptonica)
  • macOS: brew install tesseract leptonica
  • Ubuntu/Debian: sudo apt-get install libtesseract-dev libleptonica-dev
  • RHEL/CentOS/Fedora: sudo dnf install tesseract-devel leptonica-devel
  • Windows: Install from Tesseract releases or vcpkg

Development Dependencies

For development and testing, you'll also need these dependencies:

[dev-dependencies]
image = "0.25.5"

System Requirements

For Static Linking (Default)

When building with static linking, the crate will compile Tesseract and Leptonica from source. You need:

  • Rust 1.85.0 or later
  • A C++ compiler (e.g., gcc, clang, MSVC on Windows)
  • CMake 3.x or later
  • Internet connection (for downloading Tesseract source code)

For Dynamic Linking

When using dynamic linking with system-installed libraries, you need:

  • Rust 1.85.0 or later
  • Tesseract 5.x and Leptonica libraries installed on your system (see Installation section)
  • Internet connection (for downloading Tesseract source code)

No C++ compiler or CMake required for dynamic linking builds.

For a full development environment checklist (including optional tooling suggestions), see CONTRIBUTING.md.

Environment Variables

The following environment variables affect the build and test process:

Build Variables

  • CARGO_CLEAN: If set, cleans the cache directory before building
  • RUSTC_WRAPPER: If set to "sccache", enables compiler caching with sccache
  • CC: Compiler selection for C code (affects Linux builds)
  • HOME (Unix) or APPDATA (Windows): Used to determine cache directory location
  • TESSERACT_RS_CACHE_DIR: Optional override for the cache root. When unset or not writable, the build falls back to the default OS-specific directory, and if that still fails, a temporary directory under the system temp folder is used automatically.

Test Variables

  • TESSDATA_PREFIX (Optional): Path to override the default tessdata directory. If not set, the crate will use its default cache directory.

Cache and Data Directories

The crate uses the following directory structure based on your operating system:

  • macOS: ~/Library/Application Support/tesseract-rs
  • Linux: ~/.tesseract-rs
  • Windows: %APPDATA%/tesseract-rs

The cache includes:

  • Compiled Tesseract and Leptonica libraries
  • Third-party source code

Training data is not downloaded during the build. Provide eng.traineddata (and any other languages you need) via TESSDATA_PREFIX or your system Tesseract installation.

Testing

The project includes several integration tests that verify OCR functionality. To run the tests:

  1. Ensure you have the required test dependencies:

    [dev-dependencies]
    image = "0.25.9"
    
  2. Run the tests:

    cargo test
    

Note: Make sure eng.traineddata is available in your tessdata directory before running tests. If TESSDATA_PREFIX is not set, the tests look in the default cache location. You can point the tests at a custom tessdata directory by setting:

# Linux/macOS
export TESSDATA_PREFIX=/path/to/custom/tessdata

# Windows (PowerShell)
$env:TESSDATA_PREFIX="C:\path\to\custom\tessdata"

Available test cases:

  • OCR on English sample images
  • Error handling and invalid input coverage

Test images are sourced from the shared test_documents/ directory in the repository:

  • images/test_hello_world.png: Simple English text
  • tables/simple_table.png: Basic table with English headers

Usage

Here's a basic example of how to use tesseract-rs:

use std::path::PathBuf;
use std::error::Error;
use kreuzberg_tesseract::TesseractAPI;

fn get_default_tessdata_dir() -> PathBuf {
    if cfg!(target_os = "macos") {
        let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
        PathBuf::from(home_dir)
            .join("Library")
            .join("Application Support")
            .join("tesseract-rs")
            .join("tessdata")
    } else if cfg!(target_os = "linux") {
        let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
        PathBuf::from(home_dir)
            .join(".tesseract-rs")
            .join("tessdata")
    } else if cfg!(target_os = "windows") {
        PathBuf::from(std::env::var("APPDATA").expect("APPDATA environment variable not set"))
            .join("tesseract-rs")
            .join("tessdata")
    } else {
        panic!("Unsupported operating system");
    }
}

fn get_tessdata_dir() -> PathBuf {
    match std::env::var("TESSDATA_PREFIX") {
        Ok(dir) => {
            let path = PathBuf::from(dir);
            println!("Using TESSDATA_PREFIX directory: {:?}", path);
            path
        }
        Err(_) => {
            let default_dir = get_default_tessdata_dir();
            println!(
                "TESSDATA_PREFIX not set, using default directory: {:?}",
                default_dir
            );
            default_dir
        }
    }
}

fn main() -> Result<(), Box<dyn Error>> {
    let api = TesseractAPI::new()?;

    // Get tessdata directory (uses default location or TESSDATA_PREFIX if set)
    let tessdata_dir = get_tessdata_dir();
    api.init(tessdata_dir.to_str().unwrap(), "eng")?;

    let width = 24;
    let height = 24;
    let bytes_per_pixel = 1;
    let bytes_per_line = width * bytes_per_pixel;

    // Initialize image data with all white pixels
    let mut image_data = vec![255u8; width * height];

    // Draw number 9 with clearer distinction
    for y in 4..19 {
        for x in 7..17 {
            // Top bar
            if y == 4 && x >= 8 && x <= 15 {
                image_data[y * width + x] = 0;
            }
            // Top curve left side
            if y >= 4 && y <= 10 && x == 7 {
                image_data[y * width + x] = 0;
            }
            // Top curve right side
            if y >= 4 && y <= 11 && x == 16 {
                image_data[y * width + x] = 0;
            }
            // Middle bar
            if y == 11 && x >= 8 && x <= 15 {
                image_data[y * width + x] = 0;
            }
            // Bottom right vertical line
            if y >= 11 && y <= 18 && x == 16 {
                image_data[y * width + x] = 0;
            }
            // Bottom bar
            if y == 18 && x >= 8 && x <= 15 {
                image_data[y * width + x] = 0;
            }
        }
    }

    // Set the image data
    api.set_image(
        &image_data,
        width.try_into().unwrap(),
        height.try_into().unwrap(),
        bytes_per_pixel.try_into().unwrap(),
        bytes_per_line.try_into().unwrap(),
    )?;

    // Set whitelist for digits only
    api.set_variable("tessedit_char_whitelist", "0123456789")?;

    // Set PSM mode to single character
    api.set_variable("tessedit_pageseg_mode", "10")?;

    // Get the recognized text
    let text = api.get_utf8_text()?;
    println!("Recognized text: {}", text.trim());

    Ok(())
}

Advanced Usage

The API provides additional functionality for more complex OCR tasks, including thread-safe operations:

use kreuzberg_tesseract::TesseractAPI;
use std::sync::Arc;
use std::thread;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let tessdata_dir = get_tessdata_dir();
    let api = TesseractAPI::new()?;

    // Initialize the main API
    api.init(tessdata_dir.to_str().unwrap(), "eng")?;
    api.set_variable("tessedit_pageseg_mode", "1")?;

    // Load and prepare image data
    let (image_data, width, height) = load_test_image("sample_text.png")?;

    // Share image data across threads
    let image_data = Arc::new(image_data);
    let mut handles = vec![];

    // Spawn multiple threads for parallel OCR processing
    for _ in 0..3 {
        let api_clone = api.clone(); // Clones the API with all configurations
        let image_data = Arc::clone(&image_data);

        let handle = thread::spawn(move || {
            // Set image in each thread
            let res = api_clone.set_image(
                &image_data,
                width as i32,
                height as i32,
                3,
                3 * width as i32,
            );
            assert!(res.is_ok());

            // Perform OCR in parallel
            let text = api_clone.get_utf8_text()
                .expect("Failed to get text");
            println!("Thread result: {}", text);
        });
        handles.push(handle);
    }

    // Wait for all threads to complete
    for handle in handles {
        handle.join().unwrap();
    }

    Ok(())
}

// Helper function to get tessdata directory
fn get_tessdata_dir() -> PathBuf {
    // ... (implementation as shown in basic example)
}

// Helper function to load test image
fn load_test_image(filename: &str) -> Result<(Vec<u8>, u32, u32), Box<dyn Error>> {
    let img = image::open(filename)?
        .to_rgb8();
    let (width, height) = img.dimensions();
    Ok((img.into_raw(), width, height))
}

Building

Static Linking (Default)

With static linking, the crate will automatically download and compile Tesseract and Leptonica during the build process. This may take some time on the first build (5-10 minutes), but subsequent builds will use the cached libraries.

To clean the cache and force a rebuild:

CARGO_CLEAN=1 cargo build

Dynamic Linking

With dynamic linking, the build is much faster (seconds instead of minutes) since it only links against system-installed libraries:

cargo build --no-default-features --features dynamic-linking

Note: Dynamic linking requires Tesseract and Leptonica to be installed on your system (see Installation section).

Documentation

For more detailed information, please check the API documentation.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project is based on the original tesseract-rs by Cafer Can Gündoğdu. We are grateful for the foundational work that made this project possible.

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Start for Contributors

  1. Fork and clone the repository
  2. Install uv and set up git hooks:
    curl -LsSf https://astral.sh/uv/install.sh | sh
    uvx prek install
    
  3. Make your changes following our commit message format
  4. Run tests: cargo test
  5. Submit a Pull Request

Our commit messages follow the Conventional Commits specification.

Acknowledgements

This project uses Tesseract OCR and Leptonica. We are grateful to the maintainers and contributors of these projects.


Commit count: 3034

cargo fmt