string-offsets

Crates.iostring-offsets
lib.rsstring-offsets
version0.2.0
created_at2024-11-13 18:28:34.755454+00
updated_at2025-03-28 10:13:29.230661+00
descriptionConverts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
homepage
repositoryhttps://github.com/github/rust-gems
max_upload_size
id1446924
size66,517
Alexander Neubeck (aneubeck)

documentation

README

string-offsets

Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of Unicode code points. It's therefore necessary to adjust string offsets when communicating across programming language boundaries. StringOffsets does these adjustments.

Each StringOffsets instance contains offset information for a single string. Building the data structure takes O(n) time and memory, but then most conversions are O(1).

"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.

Usage

Add this to your Cargo.toml:

[dependencies]
string-offsets = "0.1"

Then:

use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12);  // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);

See the documentation for more.

Commit count: 243

cargo fmt