string-offsets

Crates.iostring-offsets
lib.rsstring-offsets
version0.1.0
sourcesrc
created_at2024-11-13 18:28:34.755454
updated_at2024-11-13 18:28:34.755454
descriptionConverts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
homepage
repositoryhttps://github.com/github/rust-gems
max_upload_size
id1446924
size41,703
Greg Orzell (gorzell)

documentation

README

string-offsets

Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of Unicode code points. It's therefore necessary to adjust string offsets when communicating across programming language boundaries. [StringOffsets] does these adjustments.

Each StringOffsets instance contains offset information for a single string. Building the data structure takes O(n) time and memory, but then most conversions are O(1).

"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.

Usage

Add this to your Cargo.toml:

[dependencies]
string-offsets = "0.1"

Then:

use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12);  // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);

See the documentation for more.

Commit count: 162

cargo fmt