unicode-joining-type

Crates.iounicode-joining-type
lib.rsunicode-joining-type
version1.0.0
sourcesrc
created_at2019-12-13 03:30:03.23052
updated_at2024-10-30 01:57:30.47094
descriptionFast lookup of the Unicode Joining Type and Joining Group properties
homepagehttps://github.com/yeslogic/unicode-joining-type
repositoryhttps://github.com/yeslogic/unicode-joining-type
max_upload_size
id188969
size82,992
Wesley Moore (wezm)

documentation

https://docs.rs/unicode-joining-type

README

unicode-joining-type

Build Status Documentation Version Unicode Version License

Fast lookup of the Unicode Joining Type and Joining Group properties for char in Rust using Unicode 16.0 data. This crate is no-std compatible.

Usage

use unicode_joining_type::{get_joining_type, JoiningType};
use unicode_joining_type::{get_joining_group, JoiningGroup};

fn main() {
    assert_eq!(get_joining_type('A'), JoiningType::NonJoining);
    assert_eq!(get_joining_group('ھ'), JoiningGroup::KnottedHeh);
}

Performance & Implementation Notes

ucd-generate is used to generate joining_type_tables.rs and joining_group_tables.rs. A build script (build.rs) compiles each of these into a two level look up tables. The look up time is constant as it is just indexing into two arrays.

The two level approach maps a code point to a block, then to a position within a block. The allows the second level of block to be deduplicated, saving space. The code is parameterised over the block size, which must be a power of 2. The value in the build script is optimal for the data set.

This approach trades off some space for faster lookups. The joining type tables take up about 26KiB, the joining group tables take up about 6.75KiB. Benchmarks showed this approach to be ~5–10× faster than the typical binary search approach.

There is still room for further size reduction. For example, by eliminating repeated block mappings at the end of the first level block array.

Commit count: 34

cargo fmt