| Crates.io | parser-core |
| lib.rs | parser-core |
| version | 0.1.3 |
| created_at | 2025-03-17 01:07:51.213288+00 |
| updated_at | 2025-03-19 20:48:43.595301+00 |
| description | A library for extracting text from various file formats including PDF, DOCX, XLSX, PPTX, images via OCR, and more |
| homepage | |
| repository | https://github.com/excoffierleonard/parser |
| max_upload_size | |
| id | 1594899 |
| size | 5,326,386 |
The core engine of the parser project, providing functionality for extracting text from various file formats.
.pdf).docx, .xlsx, .pptx).txt, .csv, .json).png, .jpg, .webp)This package requires the following system libraries:
sudo apt install libtesseract-dev libleptonica-dev libclang-dev
brew install tesseract
Follow the instructions at Tesseract GitHub repository.
Add as a dependency in your Cargo.toml:
cargo add parser-core
Basic usage:
use parser_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Read a file
let data = std::fs::read("document.pdf")?;
// Parse the document
let text = parse(&data)?;
println!("Extracted text: {}", text);
Ok(())
}
The crate is organized around a central parse function that:
Each parser is implemented in its own module:
docx.rs - Microsoft Word documentspdf.rs - PDF documentsxlsx.rs - Microsoft Excel spreadsheetspptx.rs - Microsoft PowerPoint presentationstext.rs - Plain text formats, including CSV and JSONimage.rs - Image formats using OCRRun tests with:
cargo test
Benchmark sequential vs. parallel parsing:
cargo bench