string-offsets

Crates.io	string-offsets
lib.rs	string-offsets
version	0.2.0
created_at	2024-11-13 18:28:34.755454+00
updated_at	2025-03-28 10:13:29.230661+00
description	Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
homepage
repository	https://github.com/github/rust-gems
max_upload_size
id	1446924
size	66,517

Alexander Neubeck (aneubeck)

documentation

README

string-offsets

Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of Unicode code points. It's therefore necessary to adjust string offsets when communicating across programming language boundaries. StringOffsets does these adjustments.

Each StringOffsets instance contains offset information for a single string. Building the data structure takes O(n) time and memory, but then most conversions are O(1).

"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.

Usage

Add this to your Cargo.toml:

[dependencies]
string-offsets = "0.1"

Then:

use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12);  // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);

See the documentation for more.

Commit count: 243

string-offsets

documentation

README

string-offsets

Usage

cargo fmt