| Crates.io | utf16-simd |
| lib.rs | utf16-simd |
| version | 0.1.0 |
| created_at | 2025-12-30 23:53:53.919017+00 |
| updated_at | 2025-12-30 23:53:53.919017+00 |
| description | SIMD-accelerated UTF-16/UTF-16LE -> UTF-8 escaping (JSON/XML) |
| homepage | https://github.com/omerbenamram/evtx |
| repository | https://github.com/omerbenamram/evtx |
| max_upload_size | |
| id | 2013648 |
| size | 111,148 |
utf16-simdSIMD-accelerated UTF-16/UTF-16LE → UTF-8 conversion with JSON and XML escaping.
This crate is designed for workloads like Windows Event Log (EVTX) parsing where:
It provides:
\", \\, \n, …, and control chars as \u00XX)&, <, >, plus "/' for attributes)The hot path is modeled after the SIMD string escaping approach popularized by sonic-rs, adapted from u8 lanes to UTF-16 code-unit (u16) lanes, and cross-validated against the Zig EVTX implementation (zig-evtx) that uses a similar ASCII-first strategy.
use utf16_simd::Scratch;
let utf16le: &[u8] = b"H\0i\0 \0\"\0!\0"; // "Hi \"!"
let num_units = utf16le.len() / 2;
let mut scratch = Scratch::new();
let out = scratch.escape_json_utf16le(utf16le, num_units, true);
assert_eq!(out, br#""Hi \"!""#);
&[u16] (wide strings) to JSONuse utf16_simd::Scratch;
let wide: &[u16] = &[b'H' as u16, b'i' as u16, b' ' as u16, 0xD83D, 0xDE00]; // "Hi 😀"
let mut scratch = Scratch::new();
let out = scratch.escape_json_utf16(wide, true);
assert_eq!(std::str::from_utf8(out).unwrap(), "\"Hi 😀\"");
Notes:
&[u16] works naturally with many “wide string” crates via deref coercions.wchar crate’s wchar_t is u16, so wch!()/wchz!() output can be passed directly.wchar / widestring// Windows-only: `wchar_t` is UTF-16 (u16).
use wchar::wchz;
use utf16_simd::Scratch;
let wide: &[wchar::wchar_t] = wchz!("Hello"); // NUL-terminated
let wide = &wide[..wide.len() - 1]; // drop trailing NUL
let mut scratch = Scratch::new();
let json = scratch.escape_json_utf16(wide, true);
assert_eq!(std::str::from_utf8(json).unwrap(), "\"Hello\"");
use widestring::U16CString;
use utf16_simd::Scratch;
let s = U16CString::from_str("Hello").unwrap();
let mut scratch = Scratch::new();
let json = scratch.escape_json_utf16(s.as_slice(), true);
assert_eq!(std::str::from_utf8(json).unwrap(), "\"Hello\"");
std::io::Writeuse std::io::Write;
use utf16_simd::Scratch;
let utf16le: &[u8] = b"A\0\n\0B\0";
let units = utf16le.len() / 2;
let mut scratch = Scratch::new();
let mut out = Vec::<u8>::new();
scratch.write_json_utf16le_to(&mut out, utf16le, units, true).unwrap();
assert_eq!(out, b"\"A\\nB\"");
| Input | API | Notes |
|---|---|---|
&[u8] UTF-16LE bytes |
escape_*_utf16le(...) |
Safe for unaligned EVTX buffers. Provide num_units. Trailing odd byte is ignored. |
&[u16] UTF-16 code units |
escape_*_utf16(...) |
Endianness-independent at the API level. Slice length is the unit count. |
Wide-string crates
| Crate/type | Works with | Notes |
|---|---|---|
widestring::U16Str / U16CString |
&[u16] APIs |
Typically deref-coerce to &[u16]. |
wchar::wch!()/wchz!() |
&[u16] APIs on Windows |
On non-Windows platforms wchar_t is usually u32 → not UTF‑16. |
| Output | API | Allocations |
|---|---|---|
&mut [MaybeUninit<u8>] |
escape_* |
None (caller-provided buffer) |
borrowed &[u8] |
Scratch::{escape_*} |
amortized (scratch grows, then reuses) |
Vec<u8> |
escape_*_into |
amortized (reuses vector allocation) |
io::Write |
Scratch::{write_*_to} |
amortized (reuses scratch) |
| Target | SIMD | Selection |
|---|---|---|
x86_64 |
SSE2 | runtime feature detect (is_x86_feature_detected!("sse2")) |
aarch64 |
NEON | always available (baseline) |
| other | none | scalar fallback |
| Feature | Default | What it does |
|---|---|---|
sonic-writeext |
off | (optional) exposes write_*_utf16le() functions that write directly into sonic-rs::writer::WriteExt spare capacity |
" and \0x08 -> \b
0x0C -> \f
0x0A -> \n
0x0D -> \r
0x09 -> \t
0x00..=0x1F become \u00XX (uppercase hex).\u{...} escaping).ASCII table (JSON):
+--------------------+----------------+
| input | output |
+--------------------+----------------+
| " | \" |
| \ | \\ |
| 0x08 | \b |
| 0x0C | \f |
| 0x0A | \n |
| 0x0D | \r |
| 0x09 | \t |
| 0x00..=0x1F (rest) | \u00XX |
| everything else | UTF-8 bytes |
+--------------------+----------------+
&, <, >in_attribute=true, also escapes: " and 'ASCII table (XML):
+-------+-----------+
| input | output |
+-------+-----------+
| & | & |
| < | < |
| > | > |
| " | " (*) |
| ' | ' (*) |
+-------+-----------+
(* only when in_attribute = true)
The crate exposes:
use utf16_simd::max_escaped_len;
let units = 10;
let cap = max_escaped_len(units, true);
assert!(cap >= units * 6 + 2);
It is a safe upper bound for all modes:
\u00XX is 6 bytes." / ' are 6 bytes.+2 for surrounding quotes.u16 code unitsEVTX strings are stored as little-endian bytes:
bytes: [lo0 hi0 lo1 hi1 lo2 hi2 ...]
units: u0 u1 u2
For ASCII in UTF-16LE:
hi == 0lo <= 0x7Flo (already valid UTF‑8)The SIMD path processes 8 UTF‑16 code units at a time (128 bits):
128-bit register as 8 lanes of u16:
+----+----+----+----+----+----+----+----+
| u0 | u1 | u2 | u3 | u4 | u5 | u6 | u7 |
+----+----+----+----+----+----+----+----+
For JSON:
(u & 0xFF80) == 0 ⇒ u <= 0x7Fu == '"'u == '\\'u <= 0x1F (control)If all 8 lanes are ASCII and none need escaping:
u16 to 8×u8ASCII “bit test” intuition:
u <= 0x7F <=> u & 0xFF80 == 0
u <= 0x1F <=> u & 0xFFE0 == 0
If any lane is non-ASCII or needs escaping, we fall back to a tight per-lane loop that categorizes each code unit:
ASCII? -> maybe escape, else write byte
< 0x800 -> write 2-byte UTF-8
surrogate? -> decode pair, write 4-byte UTF-8 (else skip)
else -> write 3-byte UTF-8
Surrogate pairs at the SIMD block boundary are handled by peeking the next unit:
block N ends with: [ ... 0xD83D ] (high surrogate)
block N+1 begins: [ 0xDE00 ... ] (low surrogate)
=> consume an "extra" unit from block N+1 and emit 4 UTF-8 bytes.
In EVTX, most strings are:
"/\/controlsSo the fast path often becomes a tight “load + mask + store” loop with very few branches.
sonic-rs (Rust): SIMD string escaping design (adapted for UTF‑16 code units).
See sonic-rs’ format_string implementation in src/util/string.rs.zig-evtx (Zig): UTF‑16LE→UTF‑8 fused conversion + ASCII-first approach and edge-case behavior.Links:
https://crates.io/crates/sonic-rshttps://docs.rs/sonic-rshttps://github.com/cloudwego/sonic-rshttps://github.com/omerbenamram/EVTXhttps://github.com/omerbenamram/zig-evtxThere are cargo-fuzz harnesses under utf16-simd/fuzz/ that:
Example:
# from the repo root
cargo +nightly install cargo-fuzz
cargo +nightly fuzz run json_utf16le --manifest-path utf16-simd/fuzz/Cargo.toml