Crates.io | string-offsets |
lib.rs | string-offsets |
version | 0.1.0 |
source | src |
created_at | 2024-11-13 18:28:34.755454 |
updated_at | 2024-11-13 18:28:34.755454 |
description | Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines. |
homepage | |
repository | https://github.com/github/rust-gems |
max_upload_size | |
id | 1446924 |
size | 41,703 |
Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
Unicode code points. It's therefore necessary to adjust string offsets when communicating across
programming language boundaries. [StringOffsets
] does these adjustments.
Each StringOffsets
instance contains offset information for a single string. Building the data
structure takes O(n) time and memory, but then most conversions are O(1).
"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.
Add this to your Cargo.toml
:
[dependencies]
string-offsets = "0.1"
Then:
use string_offsets::StringOffsets;
let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);
// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers
// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
See the documentation for more.