unicode-canonical-combining-class

Crates.iounicode-canonical-combining-class
lib.rsunicode-canonical-combining-class
version0.5.0
sourcesrc
created_at2021-06-04 06:36:45.677418
updated_at2022-09-19 23:29:04.698736
descriptionFast lookup of the Canonical Combining Class property
homepagehttps://github.com/yeslogic/unicode-canonical-combining-class
repositoryhttps://github.com/yeslogic/unicode-canonical-combining-class
max_upload_size
id406024
size59,893
Developers (prince) (github:yeslogic:developers-prince)

documentation

https://docs.rs/unicode-canonical-combining-class

README

unicode-canonical-combining-class

Build Status Documentation Version Unicode Version License

Fast lookup of the Unicode Canonical Combining Class property for char in Rust using Unicode 15.0 data. This crate is no-std compatible.

Usage

use unicode_canonical_combining_class::{get_canonical_combining_class, CanonicalCombiningClass};

fn main() {
    assert_eq!(get_canonical_combining_class('ཱ'), CanonicalCombiningClass::CCC129);
}

Performance & Implementation Notes

ucd-generate is used to generate tables.rs. A build script (build.rs) compiles this into a two level look up table. The look up time is constant as it is just indexing into two arrays.

The two level approach maps a code point to a block, then to a position within a block. This allows the second level block to be deduplicated, saving space. The code is parameterised over the block size, which must be a power of 2. The value in the build script is optimal for the data set.

This approach trades off some space for faster lookups. The tables take up about 24.5KiB. Benchmarks showed this approach to be ~5–10× faster than the typical binary search approach.

It's possible there are further optimisations that could be made to eliminate some runs of repeated values in the first level array.

Commit count: 20

cargo fmt