| Crates.io | unhwp |
| lib.rs | unhwp |
| version | 0.1.6 |
| created_at | 2025-12-19 04:45:11.400708+00 |
| updated_at | 2025-12-20 08:52:19.547382+00 |
| description | A high-performance library for extracting HWP/HWPX documents into structured Markdown |
| homepage | |
| repository | https://github.com/iyulab/unhwp |
| max_upload_size | |
| id | 1994224 |
| size | 404,294 |
A high-performance Rust library for extracting HWP/HWPX Korean word processor documents into structured Markdown with assets.
cargo add unhwp
cargo install unhwp
Download from GitHub Releases:
| Platform | Architecture | File |
|---|---|---|
| Windows | x64 | unhwp-x86_64-pc-windows-msvc.zip |
| Linux | x64 | unhwp-x86_64-unknown-linux-gnu.tar.gz |
| macOS | Intel | unhwp-x86_64-apple-darwin.tar.gz |
| macOS | Apple Silicon | unhwp-aarch64-apple-darwin.tar.gz |
use unhwp::{parse_file, to_markdown};
fn main() -> unhwp::Result<()> {
// Simple text extraction
let text = unhwp::extract_text("document.hwp")?;
println!("{}", text);
// Convert to Markdown
let markdown = to_markdown("document.hwp")?;
std::fs::write("output.md", markdown)?;
Ok(())
}
unhwp provides four complementary output formats:
| Format | Method | Description |
|---|---|---|
| RawContent | doc.raw_content() |
JSON with full metadata, styles, structure |
| RawText | doc.plain_text() |
Pure text without formatting |
| Markdown | to_markdown() |
Structured Markdown |
| Images | doc.resources |
Extracted binary assets |
Get the complete document structure with all metadata:
let doc = unhwp::parse_file("document.hwp")?;
let json = doc.raw_content();
// JSON includes:
// - metadata: title, author, created, modified
// - sections: paragraphs, tables
// - styles: bold, italic, underline, font, color
// - tables: rows, cells, colspan, rowspan
// - images, equations, links
use unhwp::{Unhwp, TableFallback};
let markdown = Unhwp::new()
.with_images(true)
.with_image_dir("./assets")
.with_table_fallback(TableFallback::Html)
.with_frontmatter()
.lenient() // Skip invalid sections
.parse("document.hwp")?
.to_markdown()?;
unhwp provides C-ABI compatible bindings for use with P/Invoke:
using var doc = HwpDocument.Parse("document.hwp");
// Access multiple output formats
string markdown = doc.Markdown;
string text = doc.RawText;
string json = doc.RawContent; // Full structured JSON
// Extract images
foreach (var image in doc.Images)
{
image.SaveTo($"./images/{image.Name}");
}
See C# Integration Guide for complete documentation.
| Format | Container | Status |
|---|---|---|
| HWP 5.0+ | OLE/CFB | ✅ Supported |
| HWPX | ZIP/XML | ✅ Supported |
| HWP 3.x | Binary | ✅ Supported (feature: hwp3) |
unhwp maintains document structure during conversion:
#, ##, ###**), italic (*), underline (<u>), strikethrough (~~)| Feature | Description | Default |
|---|---|---|
hwp5 |
HWP 5.0 binary format support | ✅ |
hwpx |
HWPX XML format support | ✅ |
hwp3 |
Legacy HWP 3.x support (EUC-KR) | ❌ |
async |
Async I/O with Tokio | ❌ |
# Convert to Markdown
unhwp-cli document.hwp -o output.md
# Extract plain text
unhwp-cli document.hwp --text
# Extract with cleanup (for LLM training)
unhwp-cli document.hwp --cleanup
Run benchmarks:
cargo bench
MIT License - see LICENSE for details.