clip-sanitize

Crates.ioclip-sanitize
lib.rsclip-sanitize
version0.2.1
created_at2026-01-12 14:06:41.077672+00
updated_at2026-01-12 14:06:41.077672+00
descriptionMeta-library for robust text sanitization, repair, and normalization.
homepage
repositoryhttps://github.com/5ocworkshop
max_upload_size
id2037799
size53,074
(5ocworkshop)

documentation

README

clip-sanitize

The "Universal Adapter" for Text.

clip-sanitize is a robust Rust library designed to clean, repair, and normalize text when moving between disparate systems (e.g., Windows CP1252 to Linux UTF-8). It acts as a hygiene pipeline to prevent "paste-jacking", fix character encoding errors (Mojibake), and standardize line endings.

Features

  • Mojibake Repair: Automatically detects and fixes garbled text (e.g., é -> é) caused by double-encoding (Windows-1252 misinterpreted as UTF-8).
  • Hygiene Scrubbing:
    • Smart Quotes: Normalizes curly quotes (, ) to straight quotes (") for code compatibility.
    • Invisible Stripping: Removes zero-width spaces, byte-order marks (BOM), and other invisible characters that can cause syntax errors or hide malicious commands.
    • NBSP Normalization: Converts non-breaking spaces to standard spaces.
  • Smart Normalization:
    • Line Endings: Converts between CRLF (Windows) and LF (Linux) based on the target environment.
    • Unicode: Ensures consistent Unicode Normalization Forms (NFC).
  • Performance:
    • Cow (Clone-on-Write): Zero allocation path for text that is already clean.
    • Streaming-Friendly: Designed to work on buffers.

Installation

Add this to your Cargo.toml:

[dependencies]
clip-sanitize = "0.2.1"

Usage

Basic Usage

use clip_sanitize::{Sanitizer, FlowDirection};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // configured for moving text from Linux to Windows
    let sanitizer = Sanitizer::new(FlowDirection::LinuxToWindows);
    
    let input = b"Hello\nWorld";
    let (cleaned, report) = sanitizer.process(input)?;
    
    // Output is now CRLF: "Hello\r\nWorld"
    assert_eq!(cleaned, &b"Hello\r\nWorld"[..]);
    
    println!("Original Encoding: {}", report.original_encoding);
    Ok(())
}

Advanced Configuration

use clip_sanitize::{Sanitizer, FlowDirection, HygieneOptions, LineEnding};

let options = HygieneOptions {
    replace_nbsps: true,
    fix_smart_quotes: false, // Keep curly quotes
    strip_invisibles: true,
};

let sanitizer = Sanitizer::new(FlowDirection::Custom)
    .repair(true)             // Fix Mojibake
    .hygiene(options)         // Custom hygiene
    .line_ending(LineEnding::Lf); // Force Linux line endings

Flow Directions

  • FlowDirection::LinuxToWindows: Enforces CRLF, enables full hygiene.
  • FlowDirection::WindowsToLinux: Enforces LF, enables full hygiene.
  • FlowDirection::Custom: Uses default settings (repair + hygiene + LF) unless overridden.

License

MIT

Commit count: 0

cargo fmt