kreuzberg

Crates.iokreuzberg
lib.rskreuzberg
version4.1.2
created_at2025-12-09 09:00:58.072831+00
updated_at2026-01-25 12:36:35.258038+00
descriptionHigh-performance document intelligence library for Rust. Extract text, metadata, and structured data from PDFs, Office documents, images, and 50+ formats with async/sync APIs.
homepagehttps://kreuzberg.dev
repositoryhttps://github.com/kreuzberg-dev/kreuzberg
max_upload_size
id1975144
size4,047,744
Na'aman Hirschfeld (Goldziher)

documentation

https://docs.rs/kreuzberg

README

Kreuzberg

Rust Python TypeScript WASM Ruby Java Go C#

License: MIT Documentation Discord

High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 56 formats.

This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.

🚀 Version 4.1.2 Release This is a pre-release version. We invite you to test the library and report any issues you encounter.

Note: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.

Installation

[dependencies]
kreuzberg = "4.0"
tokio = { version = "1", features = ["rt", "macros"] }

PDFium Linking Options

Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. Note: Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.

Strategy Feature Flag Description Use Case
Default (Dynamic) None Links to system PDFium at runtime Development, system package users
Static pdf-static Statically links PDFium into binary Single binary distribution, no runtime dependencies
Bundled pdf-bundled Downloads and embeds PDFium in binary CI/CD, hermetic builds, largest binary size
System pdf-system Uses system PDFium via pkg-config Linux distributions with PDFium package

Example Cargo.toml configurations:

# Default (dynamic linking)
[dependencies]
kreuzberg = "4.0"

# Static linking
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-static"] }

# Bundled in binary
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-bundled"] }

# System library (requires PDFium installed)
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-system"] }

For more details on feature flags and configuration options, see the Features documentation.

System Requirements

ONNX Runtime (for embeddings)

If using embeddings functionality, ONNX Runtime must be installed:

# macOS
brew install onnxruntime

# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev

# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases

Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.

Quick Start

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

Async Extraction

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}

Batch Processing

use kreuzberg::{batch_extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let files = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
    let results = batch_extract_file(&files, None, &config).await?;

    for result in results {
        println!("{}", result.content);
    }
    Ok(())
}

OCR with Table Extraction

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng".to_string(),
            tesseract_config: Some(TesseractConfig {
                enable_table_detection: true,
                ..Default::default()
            }),
        }),
        ..Default::default()
    };

    let result = extract_file_sync("invoice.pdf", None, &config)?;

    for table in &result.tables {
        println!("{}", table.markdown);
    }
    Ok(())
}

Password-Protected PDFs

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("protected.pdf", None, &config)?;
    Ok(())
}

Extract from Bytes

use kreuzberg::{extract_bytes_sync, ExtractionConfig};
use std::fs;

fn main() -> kreuzberg::Result<()> {
    let data = fs::read("document.pdf")?;
    let config = ExtractionConfig::default();
    let result = extract_bytes_sync(&data, "application/pdf", &config)?;
    println!("{}", result.content);
    Ok(())
}

Features

The crate uses feature flags for optional functionality:

[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "excel", "ocr"] }

Available Features

Feature Description Binary Size
pdf PDF extraction via pdfium +25MB
excel Excel/spreadsheet parsing +3MB
office DOCX, PPTX extraction +1MB
email EML, MSG extraction +500KB
html HTML to markdown +1MB
xml XML streaming parser +500KB
archives ZIP, TAR, 7Z extraction +2MB
ocr OCR with Tesseract +5MB
language-detection Language detection +100KB
chunking Text chunking +200KB
quality Text quality processing +500KB

Feature Bundles

kreuzberg = { version = "4.0", features = ["full"] }
kreuzberg = { version = "4.0", features = ["server"] }
kreuzberg = { version = "4.0", features = ["cli"] }

PDF Support and Linking Options

Kreuzberg supports three PDFium linking strategies. Default is bundled-pdfium (best developer experience).

Strategy Feature Use Case Binary Size Runtime Deps
Bundled (default) bundled-pdfium Development, production +8-15MB None
Static static-pdfium Docker, musl, standalone binaries +200MB None
System system-pdfium Package managers, distros +2MB libpdfium.so

Quick Start

# Default - bundled PDFium (recommended)
[dependencies]
kreuzberg = "4.0"

# Static linking (Docker, musl)
[dependencies]
kreuzberg = { version = "4.0", features = ["static-pdfium"] }

# System PDFium (package managers)
[dependencies]
kreuzberg = { version = "4.0", features = ["system-pdfium"] }

For detailed information, see the PDFium Linking Guide.

Note: Language bindings (Python, TypeScript, Ruby, Java, Go) automatically bundle PDFium. No configuration needed.

Documentation

API Documentation – Complete API reference with examples

https://docs.kreuzberg.dev – User guide and tutorials

License

MIT License - see LICENSE for details.

Commit count: 3034

cargo fmt