undoc

Crates.ioundoc
lib.rsundoc
version0.1.10
created_at2025-12-20 13:17:52.488725+00
updated_at2025-12-21 12:57:43.695+00
descriptionHigh-performance Microsoft Office document extraction to Markdown
homepage
repositoryhttps://github.com/iyulab/undoc
max_upload_size
id1996500
size369,512
caveman (iyulab-caveman)

documentation

README

undoc

Crates.io Documentation CI License: MIT

A high-performance Rust library for extracting content from Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown, plain text, and JSON.

Features

  • Multi-format support: DOCX (Word), XLSX (Excel), PPTX (PowerPoint)
  • Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
  • Structure preservation: Headings, lists, tables, inline formatting
  • PPTX table extraction: Full table parsing from PowerPoint slides
  • CJK text support: Smart spacing for Korean, Chinese, Japanese content
  • Asset extraction: Images, charts, and embedded media
  • Text cleanup: Multiple presets for LLM training data preparation
  • Self-update: Built-in update mechanism via GitHub releases
  • C-ABI FFI: Native library for C#, Python, and other languages
  • Parallel processing: Uses Rayon for multi-section documents

Table of Contents


Installation

Pre-built Binaries (Recommended)

Download the latest release from GitHub Releases.

Windows (x64)

# Download and extract
Invoke-WebRequest -Uri "https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-x86_64-pc-windows-msvc.zip" -OutFile "undoc.zip"
Expand-Archive -Path "undoc.zip" -DestinationPath "."

# Move to a directory in PATH (optional)
Move-Item -Path "undoc.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"

# Verify installation
undoc version

Linux (x64)

# Download and extract
curl -LO https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-x86_64-unknown-linux-gnu.tar.gz
tar -xzf undoc-cli-x86_64-unknown-linux-gnu.tar.gz

# Install to /usr/local/bin (requires sudo)
sudo mv undoc /usr/local/bin/

# Or install to user directory
mkdir -p ~/.local/bin
mv undoc ~/.local/bin/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Verify installation
undoc version

macOS

# Intel Mac
curl -LO https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-x86_64-apple-darwin.tar.gz
tar -xzf undoc-cli-x86_64-apple-darwin.tar.gz

# Apple Silicon (M1/M2/M3)
curl -LO https://github.com/iyulab/undoc/releases/latest/download/undoc-cli-aarch64-apple-darwin.tar.gz
tar -xzf undoc-cli-aarch64-apple-darwin.tar.gz

# Install
sudo mv undoc /usr/local/bin/

# Verify
undoc version

Available Binaries

Platform Architecture File
Windows x64 undoc-cli-x86_64-pc-windows-msvc.zip
Linux x64 undoc-cli-x86_64-unknown-linux-gnu.tar.gz
macOS Intel undoc-cli-x86_64-apple-darwin.tar.gz
macOS Apple Silicon undoc-cli-aarch64-apple-darwin.tar.gz

Updating

undoc includes a built-in self-update mechanism:

# Check for updates
undoc update --check

# Update to latest version
undoc update

# Force reinstall (even if on latest)
undoc update --force

Install via Cargo

If you have Rust installed:

# Install CLI
cargo install undoc-cli

# Add library to your project
cargo add undoc

CLI Usage

Quick Start

# Extract all formats (Markdown, text, JSON) + media to output directory
undoc document.docx

# Specify output directory
undoc document.docx ./output

# With text cleanup for LLM training
undoc document.docx --cleanup aggressive

Output Structure

document_output/
├── extract.md      # Markdown output with frontmatter
├── extract.txt     # Plain text output
├── content.json    # Full structured JSON
└── media/          # Extracted images and media
    └── image1.jpeg

Commands

undoc <file> [output]              # Extract all formats (default)
undoc convert <file> [OPTIONS]     # Same as above, explicit command
undoc markdown <file> [OPTIONS]    # Convert to Markdown only (alias: md)
undoc text <file> [OPTIONS]        # Convert to plain text only
undoc json <file> [OPTIONS]        # Convert to JSON only
undoc info <file>                  # Show document information
undoc extract <file> [OPTIONS]     # Extract resources only
undoc update [OPTIONS]             # Self-update to latest version
undoc version                      # Show version information

Convert to Markdown

# Basic conversion (output to stdout)
undoc markdown document.docx

# Save to file
undoc markdown document.docx -o output.md

# With YAML frontmatter
undoc markdown document.docx --frontmatter -o output.md

# With text cleanup for LLM training
undoc markdown document.docx --cleanup standard -o cleaned.md

# Table rendering options
undoc markdown spreadsheet.xlsx --table-mode html -o output.md

# Limit heading depth
undoc markdown document.docx --max-heading 3 -o output.md

Markdown Options

Option Description Default
-o, --output Output file path stdout
-f, --frontmatter Include YAML frontmatter false
--table-mode Table rendering: markdown, html, ascii markdown
--cleanup Text cleanup: minimal, standard, aggressive none
--max-heading Maximum heading level (1-6) 6

Convert to Plain Text

# Basic extraction
undoc text document.docx

# With cleanup
undoc text document.docx --cleanup standard -o output.txt

Convert to JSON

# Pretty-printed JSON
undoc json document.docx -o output.json

# Compact JSON
undoc json document.docx --compact -o output.json

Show Document Information

undoc info document.docx

Output:

Document Information
────────────────────────────────────────
File: document.docx
Format: Docx
Sections: 5
Resources: 3
Title: My Document
Author: John Doe
Pages/Slides/Sheets: 10
Created: 2025-01-15T10:30:00Z
Modified: 2025-01-20T14:45:00Z

Content Statistics
────────────────────────────────────────
Words: 2500
Characters: 15000

Extract Resources

# Extract to current directory
undoc extract presentation.pptx

# Extract to specific directory
undoc extract presentation.pptx -o ./media

Self-Update

# Check for updates
undoc update --check

# Update to latest version
undoc update

# Force reinstall
undoc update --force

Examples

# Convert Word document to Markdown with frontmatter
undoc md report.docx --frontmatter -o report.md

# Convert Excel to Markdown tables
undoc md data.xlsx -o tables.md

# Convert PowerPoint to Markdown
undoc md presentation.pptx -o slides.md

# Extract all images from a document
undoc extract report.docx -o ./images

# Get document metadata
undoc info document.docx

# Convert with aggressive cleanup for AI training
undoc md document.docx --cleanup aggressive -o cleaned.md

# Batch conversion (shell)
for f in *.docx; do undoc md "$f" -o "${f%.docx}.md"; done

# Batch conversion (PowerShell)
Get-ChildItem *.docx | ForEach-Object { undoc md $_.FullName -o "$($_.BaseName).md" }

Rust Library Usage

Quick Start

use undoc::{parse_file, render};

fn main() -> undoc::Result<()> {
    // Parse document
    let doc = parse_file("document.docx")?;

    // Convert to Markdown
    let options = render::RenderOptions::default();
    let markdown = render::to_markdown(&doc, &options)?;
    println!("{}", markdown);

    // Get plain text
    let text = render::to_text(&doc, &options)?;

    // Get JSON
    let json = render::to_json(&doc, render::JsonFormat::Pretty)?;

    Ok(())
}

Render Options

use undoc::render::{RenderOptions, CleanupPreset, TableFallback};

let options = RenderOptions::new()
    .with_frontmatter(true)
    .with_table_fallback(TableFallback::Html)
    .with_cleanup_preset(CleanupPreset::Aggressive)
    .with_max_heading(3);

let markdown = render::to_markdown(&doc, &options)?;

Working with Document Structure

use undoc::parse_file;

let doc = parse_file("document.docx")?;

// Access metadata
println!("Title: {:?}", doc.metadata.title);
println!("Author: {:?}", doc.metadata.author);
println!("Created: {:?}", doc.metadata.created);

// Iterate sections
for section in &doc.sections {
    println!("Section: {:?}", section.name);
    for element in &section.elements {
        // Process paragraphs, tables, etc.
    }
}

// Extract resources
for (id, resource) in &doc.resources {
    let filename = resource.suggested_filename(id);
    std::fs::write(&filename, &resource.data)?;
}

Format Detection

use undoc::{detect_format_from_path, detect_format_from_bytes, FormatType};

// From file path
let format = detect_format_from_path("document.docx")?;
assert_eq!(format, FormatType::Docx);

// From bytes
let data = std::fs::read("document.docx")?;
let format = detect_format_from_bytes(&data)?;

C# / .NET Integration

undoc provides C-ABI compatible bindings for integration with C# and .NET applications.

Getting the Native Library

Download from GitHub Releases:

Platform Library File
Windows x64 undoc.dll
Linux x64 libundoc.so
macOS libundoc.dylib

Or build from source:

cargo build --release --features ffi

C# Wrapper Usage

using Undoc;

// Parse and convert to Markdown
string markdown = UndocNative.ToMarkdown("document.docx");

// Parse and convert to plain text
string text = UndocNative.ToText("document.docx");

// Parse and convert to JSON
string json = UndocNative.ToJson("document.docx");

// From byte array
byte[] data = File.ReadAllBytes("document.docx");
string markdown = UndocNative.ToMarkdownFromBytes(data);

See bindings/csharp/Undoc.cs for the complete wrapper implementation.


Output Formats

Markdown

Structured Markdown with preserved formatting:

  • Headings: Document headings → #, ##, ###
  • Lists: Ordered and unordered with nesting
  • Tables: Markdown tables (with HTML/ASCII fallback for complex layouts)
  • Inline styles: Bold (**), italic (*), underline, strikethrough
  • Hyperlinks: Preserved as Markdown links
  • Images: Reference-style image links

Plain Text

Pure text content without formatting markers.

JSON

Complete document structure with metadata:

{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "created": "2025-01-15T10:30:00Z",
    "modified": "2025-01-20T14:45:00Z"
  },
  "sections": [...],
  "resources": [...]
}

Supported Formats

Format Extension Status
Word .docx Supported
Excel .xlsx Supported
PowerPoint .pptx Supported

Feature Flags

Feature Description Default
ffi C-ABI foreign function interface No
# Cargo.toml - enable FFI
[dependencies]
undoc = { version = "0.1", features = ["ffi"] }

Performance

  • Parallel section/sheet/slide processing with Rayon
  • Efficient XML parsing with quick-xml
  • Memory-efficient handling of large documents

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Related Projects

  • unhwp - Korean HWP document extraction
Commit count: 0

cargo fmt