| Crates.io | oxidize-pdf |
| lib.rs | oxidize-pdf |
| version | 1.6.9 |
| created_at | 2025-07-10 13:58:46.707927+00 |
| updated_at | 2026-01-17 17:39:45.560842+00 |
| description | A pure Rust PDF generation and manipulation library with zero external dependencies |
| homepage | https://github.com/bzsanti/oxidizePdf |
| repository | https://github.com/bzsanti/oxidizePdf |
| max_upload_size | |
| id | 1746485 |
| size | 11,085,129 |
A pure Rust PDF generation and manipulation library with zero external PDF dependencies. Production-ready for basic PDF functionality with validated performance of 3,000-4,000 pages/second for realistic business documents, memory safety guarantees, and a compact 5.2MB binary size.
Latest: v1.6.2 - Invoice Data Extraction:
v1.3.0 - AI/RAG Integration:
Production-Ready Features (v1.2.3-v1.2.5):
Major features (v1.1.6+):
to_bytes()set_compress()Significant improvements in PDF compatibility:
OptimizedPdfReader with LRU cacheNote: *Success rates apply only to non-encrypted PDFs with basic features. The library provides basic PDF functionality. See Known Limitations for a transparent assessment of current capabilities and planned features.
Add oxidize-pdf to your Cargo.toml:
[dependencies]
oxidize-pdf = "1.6.8"
# For OCR support (optional)
oxidize-pdf = { version = "1.6.8", features = ["ocr-tesseract"] }
use oxidize_pdf::{Document, Page, Font, Color, Result};
fn main() -> Result<()> {
// Create a new document
let mut doc = Document::new();
doc.set_title("My First PDF");
doc.set_author("Rust Developer");
// Create a page
let mut page = Page::a4();
// Add text
page.text()
.set_font(Font::Helvetica, 24.0)
.at(50.0, 700.0)
.write("Hello, PDF!")?;
// Add graphics
page.graphics()
.set_fill_color(Color::rgb(0.0, 0.5, 1.0))
.circle(300.0, 400.0, 50.0)
.fill();
// Add the page and save
doc.add_page(page);
doc.save("hello.pdf")?;
Ok(())
}
use oxidize_pdf::ai::DocumentChunker;
use oxidize_pdf::parser::{PdfReader, PdfDocument};
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Load and parse PDF
let reader = PdfReader::open("document.pdf")?;
let pdf_doc = PdfDocument::new(reader);
let text_pages = pdf_doc.extract_text()?;
// Prepare page texts with page numbers
let page_texts: Vec<(usize, String)> = text_pages
.iter()
.enumerate()
.map(|(idx, page)| (idx + 1, page.text.clone()))
.collect();
// Create chunker: 512 tokens per chunk, 50 tokens overlap
let chunker = DocumentChunker::new(512, 50);
let chunks = chunker.chunk_text_with_pages(&page_texts)?;
// Process chunks for RAG pipeline
for chunk in chunks {
println!("Chunk {}: {} tokens", chunk.id, chunk.tokens);
println!(" Pages: {:?}", chunk.page_numbers);
println!(" Position: chars {}-{}",
chunk.metadata.position.start_char,
chunk.metadata.position.end_char);
println!(" Sentence boundary: {}",
chunk.metadata.sentence_boundary_respected);
// Send to embedding API, store in vector DB, etc.
// let embedding = openai.embed(&chunk.content)?;
// vector_db.insert(chunk.id, embedding, chunk.content)?;
}
Ok(())
}
use oxidize_pdf::Document;
use oxidize_pdf::text::extraction::{TextExtractor, ExtractionOptions};
use oxidize_pdf::text::invoice::{InvoiceExtractor, InvoiceField};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open PDF invoice
let doc = Document::open("invoice.pdf")?;
let page = doc.get_page(1)?;
// Extract text from page
let text_extractor = TextExtractor::new();
let extracted = text_extractor.extract_text(&doc, page, &ExtractionOptions::default())?;
// Extract structured invoice data
let invoice_extractor = InvoiceExtractor::builder()
.with_language("es") // Spanish invoices
.confidence_threshold(0.7) // 70% minimum confidence
.build();
let invoice = invoice_extractor.extract(&extracted.fragments)?;
// Access extracted fields
println!("Extracted {} fields with {:.0}% overall confidence",
invoice.field_count(),
invoice.metadata.extraction_confidence * 100.0
);
for field in &invoice.fields {
match &field.field_type {
InvoiceField::InvoiceNumber(number) => {
println!("Invoice: {} ({:.0}% confidence)", number, field.confidence * 100.0);
}
InvoiceField::TotalAmount(amount) => {
println!("Total: €{:.2} ({:.0}% confidence)", amount, field.confidence * 100.0);
}
InvoiceField::InvoiceDate(date) => {
println!("Date: {} ({:.0}% confidence)", date, field.confidence * 100.0);
}
_ => {}
}
}
Ok(())
}
Supported Languages: Spanish (ES), English (EN), German (DE), Italian (IT)
Extracted Fields: Invoice number, dates, amounts (total/tax/net), VAT numbers, supplier/customer names, currency, line items
See docs/INVOICE_EXTRACTION_GUIDE.md for complete documentation.
use oxidize_pdf::{Document, Page, Font, Color, Result};
fn main() -> Result<()> {
let mut doc = Document::new();
doc.set_title("Custom Fonts Demo");
// Load a custom font from file
doc.add_font("MyFont", "/path/to/font.ttf")?;
// Or load from bytes
let font_data = std::fs::read("/path/to/font.otf")?;
doc.add_font_from_bytes("MyOtherFont", font_data)?;
let mut page = Page::a4();
// Use standard font
page.text()
.set_font(Font::Helvetica, 14.0)
.at(50.0, 700.0)
.write("Standard Font: Helvetica")?;
// Use custom font
page.text()
.set_font(Font::Custom("MyFont".to_string()), 16.0)
.at(50.0, 650.0)
.write("Custom Font: This is my custom font!")?;
// Advanced text formatting with custom font
page.text()
.set_font(Font::Custom("MyOtherFont".to_string()), 12.0)
.set_character_spacing(2.0)
.set_word_spacing(5.0)
.at(50.0, 600.0)
.write("Spaced text with custom font")?;
doc.add_page(page);
doc.save("custom_fonts.pdf")?;
Ok(())
}
use oxidize_pdf::{PdfReader, Result};
fn main() -> Result<()> {
// Open and parse a PDF
let mut reader = PdfReader::open("document.pdf")?;
// Get document info
println!("PDF Version: {}", reader.version());
println!("Page Count: {}", reader.page_count()?);
// Extract text from all pages
let document = reader.into_document();
let text = document.extract_text()?;
for (page_num, page_text) in text.iter().enumerate() {
println!("Page {}: {}", page_num + 1, page_text.content);
}
Ok(())
}
use oxidize_pdf::{Document, Page, Image, Result};
use oxidize_pdf::graphics::TransparencyGroup;
fn main() -> Result<()> {
let mut doc = Document::new();
let mut page = Page::a4();
// Load a JPEG image
let image = Image::from_jpeg_file("photo.jpg")?;
// Add image to page
page.add_image("my_photo", image);
// Draw the image
page.draw_image("my_photo", 100.0, 300.0, 400.0, 300.0)?;
// Add watermark with transparency
let watermark = TransparencyGroup::new().with_opacity(0.3);
page.graphics()
.begin_transparency_group(watermark)
.set_font(oxidize_pdf::text::Font::HelveticaBold, 48.0)
.begin_text()
.show_text("CONFIDENTIAL")
.end_text()
.end_transparency_group();
doc.add_page(page);
doc.save("image_example.pdf")?;
Ok(())
}
use oxidize_pdf::{Document, Page, Font, TextAlign, Result};
fn main() -> Result<()> {
let mut doc = Document::new();
let mut page = Page::a4();
// Create text flow with automatic wrapping
let mut flow = page.text_flow();
flow.at(50.0, 700.0)
.set_font(Font::Times, 12.0)
.set_alignment(TextAlign::Justified)
.write_wrapped("This is a long paragraph that will automatically wrap \
to fit within the page margins. The text is justified, \
creating clean edges on both sides.")?;
page.add_text_flow(&flow);
doc.add_page(page);
doc.save("text_flow.pdf")?;
Ok(())
}
use oxidize_pdf::operations::{PdfSplitter, PdfMerger, PageRange};
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Split a PDF
let splitter = PdfSplitter::new("input.pdf")?;
splitter.split_by_pages("page_{}.pdf")?; // page_1.pdf, page_2.pdf, ...
// Merge PDFs
let mut merger = PdfMerger::new();
merger.add_pdf("doc1.pdf", PageRange::All)?;
merger.add_pdf("doc2.pdf", PageRange::Pages(vec![1, 3, 5]))?;
merger.save("merged.pdf")?;
// Rotate pages
use oxidize_pdf::operations::{PdfRotator, RotationAngle};
let rotator = PdfRotator::new("input.pdf")?;
rotator.rotate_all(RotationAngle::Clockwise90, "rotated.pdf")?;
Ok(())
}
use oxidize_pdf::text::tesseract_provider::{TesseractOcrProvider, TesseractConfig};
use oxidize_pdf::text::ocr::{OcrOptions, OcrProvider};
use oxidize_pdf::operations::page_analysis::PageContentAnalyzer;
use oxidize_pdf::parser::PdfReader;
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Open a scanned PDF
let document = PdfReader::open_document("scanned.pdf")?;
let analyzer = PageContentAnalyzer::new(document);
// Configure OCR provider
let config = TesseractConfig::for_documents();
let ocr_provider = TesseractOcrProvider::with_config(config)?;
// Find and process scanned pages
let scanned_pages = analyzer.find_scanned_pages()?;
for page_num in scanned_pages {
let result = analyzer.extract_text_from_scanned_page(page_num, &ocr_provider)?;
println!("Page {}: {} (confidence: {:.1}%)",
page_num, result.text, result.confidence * 100.0);
}
Ok(())
}
Before using OCR features, install Tesseract on your system:
macOS:
brew install tesseract
brew install tesseract-lang # For additional languages
Ubuntu/Debian:
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-spa # For Spanish
sudo apt-get install tesseract-ocr-deu # For German
Windows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
Explore comprehensive examples in the examples/ directory:
recovery_corrupted_pdf.rs - Handle damaged or malformed PDFs with robust error recoverypng_transparency_watermark.rs - Create watermarks, blend modes, and transparent overlayscjk_text_extraction.rs - Work with Chinese, Japanese, and Korean textbasic_chunking.rs - Document chunking for AI/RAG pipelinesrag_pipeline.rs - Complete RAG workflow with embeddingsRun any example:
cargo run --example recovery_corrupted_pdf
cargo run --example png_transparency_watermark
cargo run --example cjk_text_extraction
Validated Metrics (based on comprehensive benchmarking):
See PERFORMANCE_HONEST_REPORT.md for detailed benchmarking methodology and results.
Check out the examples directory for more usage patterns:
hello_world.rs - Basic PDF creationgraphics_demo.rs - Vector graphics showcasetext_formatting.rs - Advanced text featurescustom_fonts.rs - TTF/OTF font loading and embeddingjpeg_image.rs - Image embeddingparse_pdf.rs - PDF parsing and text extractioncomprehensive_demo.rs - All features demonstrationtesseract_ocr_demo.rs - OCR text extraction (requires --features ocr-tesseract)scanned_pdf_analysis.rs - Analyze PDFs for scanned contentextract_images.rs - Extract embedded images from PDFscreate_pdf_with_images.rs - Advanced image embedding examplesRun examples with:
cargo run --example hello_world
# For OCR examples
cargo run --example tesseract_ocr_demo --features ocr-tesseract
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.
AGPL-3.0 ensures that oxidize-pdf remains free and open source while protecting against proprietary use in SaaS without contribution back to the community. This license:
oxidize-pdf-core is free and open source (AGPL-3.0). For commercial products and services:
Commercial Products:
Commercial License Benefits:
For commercial licensing inquiries, please open an issue on the GitHub repository.
oxidize-pdf provides basic PDF functionality. We prioritize transparency about what works and what doesn't.
We're actively adding more examples for core features. New examples include:
merge_pdfs.rs - PDF merging with various optionssplit_pdf.rs - Different splitting strategiesextract_text.rs - Text extraction with layout preservationencryption.rs - RC4 and AES encryption demonstrationsoxidize-pdf/
├── oxidize-pdf-core/ # Core PDF library (AGPL-3.0)
├── test-suite/ # Comprehensive test suite
├── docs/ # Documentation
│ ├── technical/ # Technical docs and implementation details
│ └── reports/ # Analysis and test reports
├── tools/ # Development and analysis tools
├── scripts/ # Build and release scripts
└── test-pdfs/ # Test PDF files
Commercial Products (available separately under commercial license):
See REPOSITORY_ARCHITECTURE.md for detailed information.
oxidize-pdf includes comprehensive test suites to ensure reliability:
# Run standard test suite (synthetic PDFs)
cargo test
# Run all tests including performance benchmarks
cargo test -- --ignored
# Run with local PDF fixtures (if available)
OXIDIZE_PDF_FIXTURES=on cargo test
# Run OCR tests (requires Tesseract installation)
cargo test tesseract_ocr_tests --features ocr-tesseract -- --ignored
For enhanced testing with real-world PDFs, you can optionally set up local PDF fixtures:
tests/fixtures -> /path/to/your/pdf/collection.gitignore)Note: CI/CD always uses synthetic PDFs only for consistent, fast builds.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
oxidize-pdf is under active development. Our focus areas include:
We prioritize features based on community feedback and real-world usage. Have a specific need? Open an issue to discuss!
Built with ❤️ using Rust. Special thanks to the Rust community and all contributors.