rehuman

Crates.iorehuman
lib.rsrehuman
version0.1.0
created_at2025-10-27 12:26:49.802694+00
updated_at2025-10-27 12:26:49.802694+00
descriptionUnicode-safe text cleaning & typographic normalization for Rust
homepagehttps://github.com/pszemraj/rehuman
repositoryhttps://github.com/pszemraj/rehuman
max_upload_size
id1902832
size142,550
Peter (pszemraj)

documentation

https://docs.rs/rehuman

README

rehuman

Unicode-safe text cleaning & normalization for Rust.

Strip invisible characters, normalize typography, and enforce consistent formatting-ideal for text sourced from web scraping, user input, or LLMs.

This crate is a Rust rewrite and expansion of humanize-ai-lib by Nordth.

Why rehuman?

Untrusted text often contains:

  • Zero-width spaces and control characters that break parsers
  • Mixed quote styles that defeat string matching
  • Non-breaking spaces that masquerade as regular spaces
  • Inconsistent Unicode normalization that produces duplicate keys

rehuman fixes this in a single pass with predictable, measurable output.

Installation

Library crate: add rehuman to your project with cargo add rehuman or edit Cargo.toml:

[dependencies]
rehuman = "0.1.0" # replace with the latest published version

CLI binaries: install the published release (installs both rehuman and ishuman):

cargo install rehuman
Click to Expand: Build from Source

For the latest version(s), clone this repo and run cargo install --path .:

git clone https://github.com/pszemraj/rehuman.git
cd rehuman
cargo install --path .

Binaries will be installed to ~/.cargo/bin by default.1

Quick Start

[!WARNING] This is an early release focused on correctness. Performance optimizations are in progress. Use --stream or StreamCleaner to stream large files.

Library

use rehuman::{clean, humanize};

let cleaned = clean("Hello\u{200B}there"); // -> "Hello there"
let humanized = humanize("“Quote”—and…more"); // -> "\"Quote\"-and...more"

[!IMPORTANT] By default rehuman::clean removes emoji to guarantee ASCII-only output2.

use rehuman::clean;

// Default behavior removes emoji
let cleaned = clean("Thanks 👍"); // -> "Thanks "

To keep emoji, construct a cleaner with CleaningOptions::builder().keyboard_only(false) (or pass --keep-emoji on the CLI).

CLI

rehuman reads the input and emits cleaned text to STDOUT-your source file stays untouched unless you pass --inplace:

# Stream-clean to STDOUT and capture stats
rehuman notes.txt --stream --stats > notes.cleaned.txt

# Overwrite the original file in place
rehuman notes.txt --inplace

[!TIP] Both CLI tools act as filters, so you can drop them into pipelines

cat notes.txt | rehuman --stream | tee notes.cleaned.txt
curl https://example.com/raw.txt | rehuman --stream --stats-json >/tmp/clean.txt

Use ishuman when you only need detection:

# Exit status 0 when clean, 1 when changes would be made (no stdout by default)
ishuman notes.txt

# Add --stats or --json to explain what would change
ishuman notes.txt --stats

Run rehuman --help or ishuman --help for the full list of flags (emoji policy, line endings, configs, streaming, etc.).

Documentation

More details are available in the docs/ folder:

Detailed Features

  • Invisible character removal: ZWSP, BOM, bidi isolates, control characters
  • Space normalization: NBSP, figure space, ideographic space → ASCII space
  • Typography fixes: curly quotes → ASCII, em/en dash → hyphen, ellipsis → three dots
  • Unicode normalization: NFC/NFD/NFKC/NFKD (unorm feature, enabled by default)
  • Whitespace controls: optional collapsing, trimming, and line-ending normalization
  • Keyboard-only enforcement: ASCII output with configurable emoji policy
  • Detailed stats: every cleaning run reports what changed
  • CLI tooling: rehuman (cleaner) and ishuman (detector) with streaming & in-place modes

License

MIT

Footnotes

  1. You may need to add ~/.cargo/bin to your PATH if it is not already there; add export PATH="$HOME/.cargo/bin:$PATH" to your shell profile (.bashrc, .zshrc, etc.).

  2. This is a deliberate design choice given the propensity of today's LLMs to spam emoji in their outputs.

Commit count: 0

cargo fmt