| Crates.io | mupdf-basic-text-extractor |
| lib.rs | mupdf-basic-text-extractor |
| version | 0.4.0 |
| created_at | 2025-06-25 08:47:50.933848+00 |
| updated_at | 2025-07-15 08:32:27.362779+00 |
| description | Basic structured text extraction using mupdf-rs. |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1725510 |
| size | 649,941 |
This crate uses the mupdf-rs bindings to do a very simple structured text extraction.
Because of the usage of mupdf itself, this extractor is AGPL-Licensed.
This module is not built for broad, generalized usage. But it may be a simple jumping-off point, an example of how to use the mupdf bindings.
This module assumes the following is true for the use case:
This Basic Text Extractor is what it says on the tin: basic.
The structure looks like this for each page --NOTE that font_name is a field, but there doesn't appear to be a way to get the font in the bindings (or, at least I have not properly identified it.)
Page {
lines {
Line {
text_fragments {
Fragment {
text: String,
x: f64,
y: f64,
font_name: Option<String>,
font_size: f64,
bbox_width: f64,
bbox_height: f64
}
}
}
}
}
A Line is a series of TextFragments which share the same Y-Value. The fragments within the line are sorted by their X-value to be in proper PDF left-to-right order.
use mupdf_basic_text_extractor::get_structured_document_from_filepath;
let document: Result<Doc, Box<dyn std::error::Error>> = get_structured_document_from_filepath(path);
for page in document.pages {
for line in page.lines {
for fragment in line.text_fragments {
todo()!
}
}
}
MuPDF uses a top-left coordinate system. Not only that, but it was not clear to me what counts as the "local origin" for a text element. The x,y positions now derive directly from the lower-left of the bounding box, and the y-height is calculated as the difference from the page height.
This now reflects the PDF coordinate system, with 0,0 being in the bottom left.