| Crates.io | utf-64 |
| lib.rs | utf-64 |
| version | 0.1.0 |
| created_at | 2025-10-05 21:41:03.709305+00 |
| updated_at | 2025-10-05 21:41:03.709305+00 |
| description | The next-generation text encoding standard using 64 bits per character |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1869479 |
| size | 51,961 |
The next-generation text encoding standard. UTF64 provides fixed-width character representation using 64 bits per character, solving the fundamental problems that have plagued variable-width encodings for decades.
UTF64 eliminates the variable-width limitations of UTF-8 and UTF-16 by using a consistent 64-bit representation for every Unicode character. This design delivers constant-time character indexing and dramatically simplifies string manipulation operations.
Each UTF64 character consists of 64 bits (8 bytes) with the following layout:
Bits 63-32 (Upper 32 bits): UTF-8 encoding (left-aligned, zero-padded)
Bits 31-0 (Lower 32 bits): Reserved for future use (MUST be zero in v1.0)
Important: This is the initial version of the UTF64 specification. The lower 32 bits are currently required to be zero to maintain forward compatibility. Future versions of the specification may define uses for these bits, enabling backward-compatible extensions while v1.0 implementations can continue to operate by validating and rejecting non-zero reserved bits.
ASCII Character 'A' (U+0041):
Binary: 0x41000000_00000000
└─ UTF-8 ─┘└─Reserved─┘
Euro Sign '€' (U+20AC):
Binary: 0xE282AC00_00000000
└─ UTF-8 ─┘└─Reserved─┘
Emoji '😀' (U+1F600):
Binary: 0xF09F9880_00000000
└─ UTF-8 ─┘└─Reserved─┘
Add this to your Cargo.toml:
[dependencies]
utf64 = "0.1"
use utf64::String64;
// Create a UTF64 string from a standard string
let text = String64::from("Hello, 世界! 🌍");
// Get the length (number of characters)
assert_eq!(text.len(), 10);
// Convert back to a standard Rust String
let decoded = text.to_string().unwrap();
assert_eq!(decoded, "Hello, 世界! 🌍");
// Empty strings
let empty = String64::new();
assert!(empty.is_empty());
UTF64 outperforms legacy encodings across all key algorithmic operations:
| Operation | UTF-8 | UTF-16 | UTF64 |
|---|---|---|---|
| Character Access | O(n) | O(n)* | O(1) |
| Length Calculation | O(n) | O(n)* | O(1) |
| Memory per ASCII | 1 byte | 2 bytes | 8 bytes |
| Memory per CJK | 3 bytes | 2 bytes | 8 bytes |
| Memory per Emoji | 4 bytes | 4 bytes | 8 bytes |
* UTF-16 degrades to O(n) with surrogate pairs, revealing the inherent complexity of variable-width encodings
UTF64's 8-byte fixed-width design delivers exceptional cache performance that variable-width encodings cannot match:
Perfect Cache Line Alignment
Predictable Memory Access Patterns
base + (index × 8)Contrast with Variable-Width Encodings
UTF64's elegant architecture is straightforward to implement and verify, eliminating the error-prone complexity of variable-width parsing.
u64 valueThe simplicity of this process ensures correct implementation and enables aggressive compiler optimizations.
u64 in the UTF64 string:
The fixed-width format eliminates all boundary-detection logic, making decoding trivially parallelizable.
The library provides comprehensive error handling:
InvalidUtf8: Input contains malformed UTF-8InvalidUtf64: UTF64 data is corruptedNonZeroReservedBits: Reserved bits violated (not v1.0 compliant)UTF64 v1.0 is the foundational specification. The 32 reserved bits per character provide extensive room for future standardization efforts.
The lower 32 bits reserved in v1.0 enable potential future specification versions to add:
Text Metadata (v2.0+)
Advanced Features (v3.0+)
Enterprise & Emerging Tech (v4.0+)
UTF64 is designed for graceful version compatibility:
MIT OR Apache-2.0
Contributions are welcome! Please ensure all tests pass:
cargo test
cargo clippy
cargo fmt
Q: Why 64 bits per character? A: 64 bits provides the optimal balance: 32 bits for UTF-8 compatibility and 32 bits for future extensibility. This design eliminates the complexity of variable-length encodings while delivering superior performance.
Q: Isn't this wasteful of memory? A: No. Memory is abundant in modern systems. UTF64 prioritizes developer productivity and application performance over obsolete storage constraints. The cache efficiency and O(1) indexing benefits far exceed any storage considerations. Modern applications are bottlenecked by algorithmic complexity, not memory capacity.
Q: How does this compare to UTF-32? A: UTF64 delivers superior O(1) indexing performance while also embedding UTF-8 encoding and providing 32 reserved bits for future features. UTF-32 offers none of these advantages and wastes 11 bits per character.
Q: Is this production-ready? A: Yes. UTF64 is a complete, robust implementation ready for adoption in any application that values performance and simplicity.
Q: Why should I migrate from UTF-8? A: UTF64 eliminates the constant complexity tax of variable-width encoding. Every string operation becomes simpler, faster, and more predictable. Character indexing goes from O(n) to O(1). Cache efficiency improves dramatically. Code becomes cleaner without boundary-scanning logic. The question is: why continue struggling with UTF-8's limitations?
Q: Can I use this with existing text processing tools? A: UTF64 provides seamless conversion to UTF-8 for interoperability with legacy systems. The embedded UTF-8 encoding ensures zero-overhead integration.