utf-64

Crates.io	utf-64
lib.rs	utf-64
version	0.1.0
created_at	2025-10-05 21:41:03.709305+00
updated_at	2025-10-05 21:41:03.709305+00
description	The next-generation text encoding standard using 64 bits per character
homepage
repository
max_upload_size
id	1869479
size	51,961

James Gill (JamesPatrickGill)

documentation

README

UTF64

The next-generation text encoding standard. UTF64 provides fixed-width character representation using 64 bits per character, solving the fundamental problems that have plagued variable-width encodings for decades.

Overview

UTF64 eliminates the variable-width limitations of UTF-8 and UTF-16 by using a consistent 64-bit representation for every Unicode character. This design delivers constant-time character indexing and dramatically simplifies string manipulation operations.

Encoding Specification

Version 1.0

Each UTF64 character consists of 64 bits (8 bytes) with the following layout:

Bits 63-32 (Upper 32 bits): UTF-8 encoding (left-aligned, zero-padded)
Bits 31-0  (Lower 32 bits): Reserved for future use (MUST be zero in v1.0)

Important: This is the initial version of the UTF64 specification. The lower 32 bits are currently required to be zero to maintain forward compatibility. Future versions of the specification may define uses for these bits, enabling backward-compatible extensions while v1.0 implementations can continue to operate by validating and rejecting non-zero reserved bits.

Examples

ASCII Character 'A' (U+0041):

Binary:  0x41000000_00000000
         └─ UTF-8 ─┘└─Reserved─┘

Euro Sign '€' (U+20AC):

Binary:  0xE282AC00_00000000
         └─ UTF-8 ─┘└─Reserved─┘

Emoji '😀' (U+1F600):

Binary:  0xF09F9880_00000000
         └─ UTF-8 ─┘└─Reserved─┘

Features

Superior O(1) Character Indexing: Direct access to any character without scanning—no other encoding matches this performance
Simplified Parsing: Eliminates the complexity of continuation bytes and surrogate pairs entirely
Predictable Memory Architecture: Fixed-width layout guarantees optimal cache behavior and memory access patterns
Future-Ready Design: 32 reserved bits per character enable unlimited extensibility
Seamless UTF-8 Integration: Natively embeds UTF-8 encoding for zero-overhead conversion

Installation

Add this to your Cargo.toml:

[dependencies]
utf64 = "0.1"

Usage

use utf64::String64;

// Create a UTF64 string from a standard string
let text = String64::from("Hello, 世界! 🌍");

// Get the length (number of characters)
assert_eq!(text.len(), 10);

// Convert back to a standard Rust String
let decoded = text.to_string().unwrap();
assert_eq!(decoded, "Hello, 世界! 🌍");

// Empty strings
let empty = String64::new();
assert!(empty.is_empty());

Performance Characteristics

UTF64 outperforms legacy encodings across all key algorithmic operations:

Operation	UTF-8	UTF-16	UTF64
Character Access	O(n)	O(n)*	O(1)
Length Calculation	O(n)	O(n)*	O(1)
Memory per ASCII	1 byte	2 bytes	8 bytes
Memory per CJK	3 bytes	2 bytes	8 bytes
Memory per Emoji	4 bytes	4 bytes	8 bytes

* UTF-16 degrades to O(n) with surrogate pairs, revealing the inherent complexity of variable-width encodings

Cache Performance and Memory Architecture

UTF64's 8-byte fixed-width design delivers exceptional cache performance that variable-width encodings cannot match:

Perfect Cache Line Alignment

Modern CPUs use 64-byte cache lines
UTF64 stores exactly 8 characters per cache line with zero waste
Sequential character access exhibits perfect spatial locality
Hardware prefetchers can predict and load UTF64 data with maximum efficiency

Predictable Memory Access Patterns

Every character access is a simple offset calculation: base + (index × 8)
No unpredictable branching or scanning required
CPUs can pipeline UTF64 operations aggressively
SIMD operations can process multiple characters in parallel without complex masking

Contrast with Variable-Width Encodings

UTF-8 forces cache-inefficient byte-by-byte scanning
Character boundaries split across cache lines cause performance penalties
Unpredictable character widths defeat hardware prefetching
UTF64 eliminates all of these problems

Comparison with Other Encodings

UTF-8: Legacy Variable-Width Encoding

Requires expensive scanning for character boundaries
O(n) indexing makes random access prohibitively slow
Compact for ASCII but unpredictable memory usage overall

UTF-16: The Worst of Both Worlds

Variable width (2-4 bytes) with surrogate pair complexity
O(n) indexing despite 2-byte minimum overhead
Not ASCII-compatible, causing endless conversion headaches

UTF-32: Almost There

Fixed width provides O(1) indexing
No UTF-8 compatibility requires constant conversion
Wastes 11 bits per character (only 21 bits needed for Unicode)
No reserved space for future requirements

UTF64: The Complete Solution

✅ Superior O(1) indexing with true constant-time character access
✅ Seamlessly embeds UTF-8 for zero-overhead conversion to legacy systems
✅ 32 reserved bits provide a future-ready architecture
✅ Optimal 8-byte alignment ensures maximum cache efficiency and hardware performance
✅ Eliminates all complexity from variable-width encoding schemes

Technical Details

UTF64's elegant architecture is straightforward to implement and verify, eliminating the error-prone complexity of variable-width parsing.

Encoding Process

For each character in the input string:
- Encode the character to UTF-8 (1-4 bytes)
- Place UTF-8 bytes in the upper 32 bits (left-aligned)
- Set lower 32 bits to zero (reserved)
- Store as a single u64 value

The simplicity of this process ensures correct implementation and enables aggressive compiler optimizations.

Decoding Process

For each u64 in the UTF64 string:
- Validate that lower 32 bits are zero
- Extract upper 32 bits
- Determine UTF-8 sequence length from first byte
- Collect UTF-8 bytes and decode to Unicode

The fixed-width format eliminates all boundary-detection logic, making decoding trivially parallelizable.

Error Handling

The library provides comprehensive error handling:

InvalidUtf8: Input contains malformed UTF-8
InvalidUtf64: UTF64 data is corrupted
NonZeroReservedBits: Reserved bits violated (not v1.0 compliant)

Specification Versioning

UTF64 v1.0 is the foundational specification. The 32 reserved bits per character provide extensive room for future standardization efforts.

Reserved Bits: Future Possibilities

The lower 32 bits reserved in v1.0 enable potential future specification versions to add:

Text Metadata (v2.0+)

Character-level styling flags
Locale and language hints
Bidirectional text markers
Font family suggestions

Advanced Features (v3.0+)

Inline color information
Accessibility metadata
Security and validation flags
Application-specific extensions

Enterprise & Emerging Tech (v4.0+)

Blockchain verification data
Quantum-resistant signatures
AI/ML annotation hints
Distributed system coordination

Version Compatibility

UTF64 is designed for graceful version compatibility:

Forward compatible: v1.0 parsers will reject future-versioned data (non-zero reserved bits), ensuring no silent corruption
Backward compatible: Future parsers can detect v1.0 data (all-zero reserved bits) and process accordingly
Explicit versioning: Reserved bits allow embedding version markers for automatic detection

License

MIT OR Apache-2.0

Contributing

Contributions are welcome! Please ensure all tests pass:

cargo test
cargo clippy
cargo fmt

FAQ

Q: Why 64 bits per character? A: 64 bits provides the optimal balance: 32 bits for UTF-8 compatibility and 32 bits for future extensibility. This design eliminates the complexity of variable-length encodings while delivering superior performance.

Q: Isn't this wasteful of memory? A: No. Memory is abundant in modern systems. UTF64 prioritizes developer productivity and application performance over obsolete storage constraints. The cache efficiency and O(1) indexing benefits far exceed any storage considerations. Modern applications are bottlenecked by algorithmic complexity, not memory capacity.

Q: How does this compare to UTF-32? A: UTF64 delivers superior O(1) indexing performance while also embedding UTF-8 encoding and providing 32 reserved bits for future features. UTF-32 offers none of these advantages and wastes 11 bits per character.

Q: Is this production-ready? A: Yes. UTF64 is a complete, robust implementation ready for adoption in any application that values performance and simplicity.

Q: Why should I migrate from UTF-8? A: UTF64 eliminates the constant complexity tax of variable-width encoding. Every string operation becomes simpler, faster, and more predictable. Character indexing goes from O(n) to O(1). Cache efficiency improves dramatically. Code becomes cleaner without boundary-scanning logic. The question is: why continue struggling with UTF-8's limitations?

Q: Can I use this with existing text processing tools? A: UTF64 provides seamless conversion to UTF-8 for interoperability with legacy systems. The embedded UTF-8 encoding ensures zero-overhead integration.

Commit count: 0