| Crates.io | kreuzberg-tesseract |
| lib.rs | kreuzberg-tesseract |
| version | 4.1.2 |
| created_at | 2025-11-15 08:28:00.057882+00 |
| updated_at | 2026-01-25 12:35:55.414638+00 |
| description | Rust bindings for Tesseract OCR with cross-compilation, C++17, and caching improvements |
| homepage | https://kreuzberg.dev |
| repository | https://github.com/kreuzberg-dev/kreuzberg |
| max_upload_size | |
| id | 1934151 |
| size | 245,401 |
Rust bindings for Tesseract OCR with built-in compilation of Tesseract and Leptonica libraries. Provides a safe and idiomatic Rust interface to Tesseract's functionality while handling the complexity of compiling the underlying C++ libraries.
Based on the original tesseract-rs by Cafer Can Gündoğdu, this maintained version adds critical improvements for production use:
Static linking builds Tesseract and Leptonica from source and embeds them in your binary. No runtime dependencies required:
[dependencies]
kreuzberg-tesseract = "1.0.0-rc.1"
# or explicitly:
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["static-linking"] }
Dynamic linking uses system-installed Tesseract and Leptonica libraries. Faster builds, but requires libraries installed on the system:
[dependencies]
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["dynamic-linking"], default-features = false }
System requirements for dynamic linking:
libtesseract, libleptonica)brew install tesseract leptonicasudo apt-get install libtesseract-dev libleptonica-devsudo dnf install tesseract-devel leptonica-develFor development and testing, you'll also need these dependencies:
[dev-dependencies]
image = "0.25.5"
When building with static linking, the crate will compile Tesseract and Leptonica from source. You need:
When using dynamic linking with system-installed libraries, you need:
No C++ compiler or CMake required for dynamic linking builds.
For a full development environment checklist (including optional tooling suggestions), see CONTRIBUTING.md.
The following environment variables affect the build and test process:
CARGO_CLEAN: If set, cleans the cache directory before buildingRUSTC_WRAPPER: If set to "sccache", enables compiler caching with sccacheCC: Compiler selection for C code (affects Linux builds)HOME (Unix) or APPDATA (Windows): Used to determine cache directory locationTESSERACT_RS_CACHE_DIR: Optional override for the cache root. When unset or not writable, the build falls back to the default OS-specific directory, and if that still fails, a temporary directory under the system temp folder is used automatically.TESSDATA_PREFIX (Optional): Path to override the default tessdata directory. If not set, the crate will use its default cache directory.The crate uses the following directory structure based on your operating system:
~/Library/Application Support/tesseract-rs~/.tesseract-rs%APPDATA%/tesseract-rsThe cache includes:
Training data is not downloaded during the build. Provide eng.traineddata (and any other languages you need) via TESSDATA_PREFIX or your system Tesseract installation.
The project includes several integration tests that verify OCR functionality. To run the tests:
Ensure you have the required test dependencies:
[dev-dependencies]
image = "0.25.9"
Run the tests:
cargo test
Note: Make sure eng.traineddata is available in your tessdata directory before running tests. If TESSDATA_PREFIX is not set, the tests look in the default cache location. You can point the tests at a custom tessdata directory by setting:
# Linux/macOS
export TESSDATA_PREFIX=/path/to/custom/tessdata
# Windows (PowerShell)
$env:TESSDATA_PREFIX="C:\path\to\custom\tessdata"
Available test cases:
Test images are sourced from the shared test_documents/ directory in the repository:
images/test_hello_world.png: Simple English texttables/simple_table.png: Basic table with English headersHere's a basic example of how to use tesseract-rs:
use std::path::PathBuf;
use std::error::Error;
use kreuzberg_tesseract::TesseractAPI;
fn get_default_tessdata_dir() -> PathBuf {
if cfg!(target_os = "macos") {
let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
PathBuf::from(home_dir)
.join("Library")
.join("Application Support")
.join("tesseract-rs")
.join("tessdata")
} else if cfg!(target_os = "linux") {
let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
PathBuf::from(home_dir)
.join(".tesseract-rs")
.join("tessdata")
} else if cfg!(target_os = "windows") {
PathBuf::from(std::env::var("APPDATA").expect("APPDATA environment variable not set"))
.join("tesseract-rs")
.join("tessdata")
} else {
panic!("Unsupported operating system");
}
}
fn get_tessdata_dir() -> PathBuf {
match std::env::var("TESSDATA_PREFIX") {
Ok(dir) => {
let path = PathBuf::from(dir);
println!("Using TESSDATA_PREFIX directory: {:?}", path);
path
}
Err(_) => {
let default_dir = get_default_tessdata_dir();
println!(
"TESSDATA_PREFIX not set, using default directory: {:?}",
default_dir
);
default_dir
}
}
}
fn main() -> Result<(), Box<dyn Error>> {
let api = TesseractAPI::new()?;
// Get tessdata directory (uses default location or TESSDATA_PREFIX if set)
let tessdata_dir = get_tessdata_dir();
api.init(tessdata_dir.to_str().unwrap(), "eng")?;
let width = 24;
let height = 24;
let bytes_per_pixel = 1;
let bytes_per_line = width * bytes_per_pixel;
// Initialize image data with all white pixels
let mut image_data = vec![255u8; width * height];
// Draw number 9 with clearer distinction
for y in 4..19 {
for x in 7..17 {
// Top bar
if y == 4 && x >= 8 && x <= 15 {
image_data[y * width + x] = 0;
}
// Top curve left side
if y >= 4 && y <= 10 && x == 7 {
image_data[y * width + x] = 0;
}
// Top curve right side
if y >= 4 && y <= 11 && x == 16 {
image_data[y * width + x] = 0;
}
// Middle bar
if y == 11 && x >= 8 && x <= 15 {
image_data[y * width + x] = 0;
}
// Bottom right vertical line
if y >= 11 && y <= 18 && x == 16 {
image_data[y * width + x] = 0;
}
// Bottom bar
if y == 18 && x >= 8 && x <= 15 {
image_data[y * width + x] = 0;
}
}
}
// Set the image data
api.set_image(
&image_data,
width.try_into().unwrap(),
height.try_into().unwrap(),
bytes_per_pixel.try_into().unwrap(),
bytes_per_line.try_into().unwrap(),
)?;
// Set whitelist for digits only
api.set_variable("tessedit_char_whitelist", "0123456789")?;
// Set PSM mode to single character
api.set_variable("tessedit_pageseg_mode", "10")?;
// Get the recognized text
let text = api.get_utf8_text()?;
println!("Recognized text: {}", text.trim());
Ok(())
}
The API provides additional functionality for more complex OCR tasks, including thread-safe operations:
use kreuzberg_tesseract::TesseractAPI;
use std::sync::Arc;
use std::thread;
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
let tessdata_dir = get_tessdata_dir();
let api = TesseractAPI::new()?;
// Initialize the main API
api.init(tessdata_dir.to_str().unwrap(), "eng")?;
api.set_variable("tessedit_pageseg_mode", "1")?;
// Load and prepare image data
let (image_data, width, height) = load_test_image("sample_text.png")?;
// Share image data across threads
let image_data = Arc::new(image_data);
let mut handles = vec![];
// Spawn multiple threads for parallel OCR processing
for _ in 0..3 {
let api_clone = api.clone(); // Clones the API with all configurations
let image_data = Arc::clone(&image_data);
let handle = thread::spawn(move || {
// Set image in each thread
let res = api_clone.set_image(
&image_data,
width as i32,
height as i32,
3,
3 * width as i32,
);
assert!(res.is_ok());
// Perform OCR in parallel
let text = api_clone.get_utf8_text()
.expect("Failed to get text");
println!("Thread result: {}", text);
});
handles.push(handle);
}
// Wait for all threads to complete
for handle in handles {
handle.join().unwrap();
}
Ok(())
}
// Helper function to get tessdata directory
fn get_tessdata_dir() -> PathBuf {
// ... (implementation as shown in basic example)
}
// Helper function to load test image
fn load_test_image(filename: &str) -> Result<(Vec<u8>, u32, u32), Box<dyn Error>> {
let img = image::open(filename)?
.to_rgb8();
let (width, height) = img.dimensions();
Ok((img.into_raw(), width, height))
}
With static linking, the crate will automatically download and compile Tesseract and Leptonica during the build process. This may take some time on the first build (5-10 minutes), but subsequent builds will use the cached libraries.
To clean the cache and force a rebuild:
CARGO_CLEAN=1 cargo build
With dynamic linking, the build is much faster (seconds instead of minutes) since it only links against system-installed libraries:
cargo build --no-default-features --features dynamic-linking
Note: Dynamic linking requires Tesseract and Leptonica to be installed on your system (see Installation section).
For more detailed information, please check the API documentation.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is based on the original tesseract-rs by Cafer Can Gündoğdu. We are grateful for the foundational work that made this project possible.
We welcome contributions! Please see our Contributing Guide for details.
curl -LsSf https://astral.sh/uv/install.sh | sh
uvx prek install
cargo testOur commit messages follow the Conventional Commits specification.
This project uses Tesseract OCR and Leptonica. We are grateful to the maintainers and contributors of these projects.