| Crates.io | kreuzberg-cli |
| lib.rs | kreuzberg-cli |
| version | 4.1.2 |
| created_at | 2025-12-09 09:03:55.002219+00 |
| updated_at | 2026-01-25 12:38:52.906623+00 |
| description | Command-line interface for Kreuzberg document intelligence |
| homepage | https://kreuzberg.dev |
| repository | https://github.com/kreuzberg-dev/kreuzberg |
| max_upload_size | |
| id | 1975146 |
| size | 261,056 |
Command-line interface for the Kreuzberg document intelligence library.
This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface.
The CLI supports 56 file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents.
Kreuzberg Core Library (crates/kreuzberg)
↓
Kreuzberg CLI (crates/kreuzberg-cli) ← This crate
↓
Command-line interface with configuration and caching
api feature): Start REST API server for remote document processingmcp feature): Start Model Context Protocol server for AI integrationThe CLI is tested and officially supported on:
All platforms receive precompiled binaries through GitHub releases and are tested in continuous integration.
cargo install --path crates/kreuzberg-cli
Or via the workspace:
cargo build --release -p kreuzberg-cli
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
brew install onnxruntime
# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev
# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
To enable optical character recognition for scanned documents:
brew install tesseractsudo apt-get install tesseract-ocrFor .doc and .ppt file extraction:
brew install libreofficesudo apt-get install libreofficeThe CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.
# Extract text from a PDF
kreuzberg extract document.pdf
# Extract with JSON output
kreuzberg extract document.pdf --format json
# Enable OCR for scanned documents
kreuzberg extract scanned.pdf --ocr true
# Force OCR even if text extraction succeeds
kreuzberg extract mixed.pdf --force-ocr true
# Process multiple documents in parallel
kreuzberg batch *.pdf --format json
# Process with custom configuration
kreuzberg batch documents/*.docx --config config.toml --format json
# Detect file type
kreuzberg detect unknown-file
# JSON output
kreuzberg detect unknown-file --format json
# View cache statistics
kreuzberg cache stats
# Clear the cache
kreuzberg cache clear --cache-dir /path/to/cache
# Custom cache directory
kreuzberg cache stats --cache-dir ~/.kreuzberg-cache
api feature)# Start API server on localhost:8000
kreuzberg serve
# Custom host and port
kreuzberg serve --host 0.0.0.0 --port 3000
# With configuration file
kreuzberg serve --config kreuzberg.toml --host 127.0.0.1 --port 8080
mcp feature)# Start Model Context Protocol server
kreuzberg mcp
# With configuration file
kreuzberg mcp --config kreuzberg.toml
The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be:
--config /path/to/config.{toml,yaml,json}kreuzberg.{toml,yaml,json} in current and parent directories# Basic extraction settings
use_cache = true
enable_quality_processing = true
force_ocr = false
# OCR configuration
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
enable_table_detection = true
psm = 6
min_confidence = 50.0
# Text chunking (useful for LLM processing)
[chunking]
max_chars = 1000
max_overlap = 200
# PDF-specific options
[pdf_options]
extract_images = true
extract_metadata = true
passwords = []
# Language detection
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
# Image extraction
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
Command-line flags override configuration file settings:
# Override OCR setting from config
kreuzberg extract document.pdf --config config.toml --ocr false
# Override chunking settings
kreuzberg extract long.pdf --chunk true --chunk-size 2000 --chunk-overlap 400
# Disable cache despite config file
kreuzberg extract document.pdf --no-cache true
# Enable language detection
kreuzberg extract multilingual.pdf --detect-language true
Extract text, tables, and metadata from a document.
kreuzberg extract <PATH> [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--mime-type <TYPE>: MIME type hint (auto-detected if not provided)--format <FORMAT>: Output format (text or json), default: text--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--chunk <true|false>: Enable text chunking--chunk-size <SIZE>: Chunk size in characters (default: 1000)--chunk-overlap <SIZE>: Overlap between chunks (default: 200)--quality <true|false>: Enable quality processing--detect-language <true|false>: Enable language detectionExamples:
# Simple extraction
kreuzberg extract invoice.pdf
# With configuration and JSON output
kreuzberg extract document.pdf --config config.toml --format json
# With chunking for LLM processing
kreuzberg extract report.pdf --chunk true --chunk-size 2000
# With OCR for scanned document
kreuzberg extract scanned.pdf --ocr true --format json
Process multiple documents in parallel.
kreuzberg batch <PATHS>... [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--format <FORMAT>: Output format (text or json), default: json--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--quality <true|false>: Enable quality processingExamples:
# Batch process multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.xlsx
# With glob patterns
kreuzberg batch *.pdf *.docx
# With custom configuration
kreuzberg batch documents/* --config batch-config.toml --format json
# With OCR
kreuzberg batch scanned/*.pdf --ocr true --format json
Identify the MIME type of a file.
kreuzberg detect <PATH> [OPTIONS]
Options:
--format <FORMAT>: Output format (text or json), default: textExamples:
# Simple detection
kreuzberg detect unknown-file
# JSON output
kreuzberg detect mystery.bin --format json
Manage extraction result cache.
kreuzberg cache <COMMAND> [OPTIONS]
Subcommands:
Show cache statistics.
kreuzberg cache stats [--cache-dir <DIR>] [--format <FORMAT>]
Options:
--cache-dir <DIR>: Cache directory (default: .kreuzberg in current directory)--format <FORMAT>: Output format (text or json), default: textClear the cache.
kreuzberg cache clear [--cache-dir <DIR>] [--format <FORMAT>]
Options:
--cache-dir <DIR>: Cache directory (default: .kreuzberg in current directory)--format <FORMAT>: Output format (text or json), default: textExamples:
# View cache statistics
kreuzberg cache stats
# Clear cache with custom directory
kreuzberg cache clear --cache-dir ~/.kreuzberg-cache
# JSON output
kreuzberg cache stats --format json
api feature)Start the REST API server.
kreuzberg serve [OPTIONS]
Options:
--host <HOST>: Host to bind to (default: 127.0.0.1)--port <PORT>: Port to bind to (default: 8000)--config <PATH>: Configuration file (TOML, YAML, or JSON)Examples:
# Default: localhost:8000
kreuzberg serve
# Public access on port 3000
kreuzberg serve --host 0.0.0.0 --port 3000
# With custom configuration
kreuzberg serve --config server-config.toml --port 8080
mcp feature)Start the Model Context Protocol server.
kreuzberg mcp [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)Examples:
# Start MCP server
kreuzberg mcp
# With custom configuration
kreuzberg mcp --config mcp-config.toml
Show version information.
kreuzberg version [--format <FORMAT>]
Options:
--format <FORMAT>: Output format (text or json), default: textExamples:
# Display version
kreuzberg version
# JSON output
kreuzberg version --format json
The default human-readable format:
kreuzberg extract document.pdf
# Output:
# Document content here...
For programmatic integration:
kreuzberg extract document.pdf --format json
# Output:
# {
# "content": "Document content...",
# "mime_type": "application/pdf",
# "metadata": { "title": "...", "author": "..." },
# "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }]
# }
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | 30+ additional formats |
0: Successful executionNon-zero: Error occurred (check stderr for details)Control logging verbosity with the RUST_LOG environment variable:
# Show info-level logs (default)
RUST_LOG=info kreuzberg extract document.pdf
# Show detailed debug logs
RUST_LOG=debug kreuzberg extract document.pdf
# Show only warnings and errors
RUST_LOG=warn kreuzberg extract document.pdf
# Suppress all logs
RUST_LOG=error kreuzberg extract document.pdf
# Show logs from specific modules
RUST_LOG=kreuzberg=debug kreuzberg extract document.pdf
Use batch processing for multiple files instead of sequential extraction:
kreuzberg batch *.pdf # Parallel processing
Enable caching to avoid reprocessing the same documents:
# Cache is enabled by default
kreuzberg extract document.pdf
Use appropriate chunk sizes for LLM processing:
kreuzberg extract long.pdf --chunk true --chunk-size 2000
Tune OCR settings for better performance:
kreuzberg extract scanned.pdf --ocr true
# Adjust tesseract_config in configuration file for optimization
Monitor cache size and clear when needed:
kreuzberg cache stats
kreuzberg cache clear
None by default. The binary includes core extraction.
api: Enable the REST API server (kreuzberg serve command)mcp: Enable Model Context Protocol server (kreuzberg mcp command)all: Enable all features (api + mcp)# Build with all features
cargo build --release -p kreuzberg-cli --features all
# Build with specific features
cargo build --release -p kreuzberg-cli --features api,mcp
Ensure the file path is correct and the file is readable:
# Check if file exists
ls -l /path/to/document.pdf
# Try with absolute path
kreuzberg extract /absolute/path/to/document.pdf
Verify Tesseract is installed:
tesseract --version
# If not found:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/tesseract-ocr/tesseract
Check that the configuration file has the correct format and location:
# Use explicit path
kreuzberg extract document.pdf --config /absolute/path/to/config.toml
# Or place kreuzberg.toml in current directory
ls -l kreuzberg.toml
Use chunking to reduce memory usage:
kreuzberg extract large-document.pdf --chunk true --chunk-size 1000
Ensure write access to the cache directory:
# Check permissions
ls -ld .kreuzberg
# Or use a custom directory with appropriate permissions
kreuzberg extract document.pdf --config config.toml
# In config.toml: cache_dir = "/tmp/kreuzberg-cache"
src/main.rs: CLI implementation with command definitions and argument parsingCargo.toml: Package metadata and dependenciescargo build -p kreuzberg-cli
cargo build --release -p kreuzberg-cli
cargo build --release -p kreuzberg-cli --features all
# Run CLI tests
cargo test -p kreuzberg-cli
# With logging
RUST_LOG=debug cargo test -p kreuzberg-cli -- --nocapture
../kreuzberg/We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines.
MIT