rust Ю #rustc 1.81.0 (eeb90cda1 2024-09-04)СЦ4ae'ЃО7`у/Еа -89ef43c30f9b928eС АвxшvAйЁЅQБ -1426abde5263ac46С rustc_std_workspace_coreСXP\щVB*ПD Аi -d08e4e7d05aff086С Ц is_xid_startС is_xid_continueС tablesС
FС
Align8С
щ
Align64С
щ
ASCII_STARTС
ASCII_CONTINUEС
CHUNKС
TRIE_STARTС
TRIE_CONTINUEС
LEAFС ! $ѕ
Ќх $ю " trueС ,
Дћ $ " falseС Ф
) Ј 4 ; 128С
Ј ) Ј 4 ; 128С у
@ ЈќУ
# Ј ,л
$ 64С
Ј ) Ј 4 ; 402С $Љ
Ј ) Ј 4 ; 1793С $Џw
р Ј " ) Ј 4 ; 7904С мЈ
' Т щ ЪЖЙЅИ л
_
f m y
Є
І
Д фщ
$ с щ УяќЏњрШm
[
b i u
с Ђ
Љ А
Ц tЛ\ \Ы\ А,и\ щ$п\ ! бlх\ МTє\
ќ Гb Ѓ [![github]](https://github.com/dtolnay/unicode-ident) [![crates-io]](https://crates.io/crates/unicode-ident) [![docs-rs]](https://docs.rs/unicode-ident)Сќ І Ї g [github]: https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=githubСќЋj k [crates-io]: https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rustСќn j [docs-rs]: https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rsСќ
m ѓ
СDї K Implementation of [Unicode Standard Annex #31][tr31] for determining whichСќN = `char` values are valid in programming language identifiers.Сќг@ . [tr31]: https://www.unicode.org/reports/tr31/Сќ1 Ъ K This crate is a better optimized implementation of the older `unicode-xid`СќЮN I crate. This crate uses less static storage, and is able to classify bothСќL J ASCII and non-ASCII codepoints with better performance, 2–10×СќъM faster than `unicode-xid`.СєИ з ВDл ф ## Comparison of performanceСќш G The following table shows a comparison between five Unicode identifierСќJ implementations.СЄи э ! - `unicode-ident` is this crate;Сќё$ F - [`unicode-xid`] is a widely used crate run by the "unicode-rs" org;Сќ I @ - `ucd-trie` and `fst` are two data structures supported by theСќр C [`ucd-generate`] tool;СфЄ
: - [`roaring`] is a Rust implementation of Roaring bitmap.СќС
= џ
M The *static storage* column shows the total size of `static` tables that theСќP : crate bakes into your binary, measured in 1000s of bytes.Сќд= G The remaining columns show the **cost per call** to evaluate whether aСќJ D single `char` has the XID\_Start or XID\_Continue Unicode property,СќсG J comparing across different ratios of ASCII to non-ASCII codepoints in theСќЉ
M input data.С|ї
; [`unicode-xid`]: https://github.com/unicode-rs/unicode-xidСќ> = [`ucd-generate`]: https://github.com/BurntSushi/ucd-generateСќЪ@ 9 [`roaring`]: https://github.com/RoaringBitmap/roaring-rsСќ< Ш > | | static storage | 0% nonascii | 1% | 10% | 100% nonascii |СќЬA |---|---|---|---|---|---|Сь I | **`unicode-ident`** | 10.4 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns |СќЌL H | **`unicode-xid`** | 11.8 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns |СќљK D | **`ucd-trie`** | 10.3 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns |СќХG > | **`fst`** | 144 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns |СќA C | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns |СќЯF K Source code for the benchmark is provided in the *bench* directory of thisСќN 7 repo and may be repeated by running `cargo criterion`.Сќщ: Є ВDЈ Б ! ## Comparison of data structuresСќЕ$ к #### unicode-xidСЄо ѓ L They use a sorted array of character ranges, and do a binary search to lookСќїO ? up whether a given character lands inside one of those ranges.СќЧB ```rustС\ # const _: &str = stringify! {Сќ" 3 static XID_Continue_table: [(char, char); 763] = [СќН6 " ('\u{30}', '\u{39}'), // 0-9Сќє% " ('\u{41}', '\u{5a}'), // A-ZСќ% # "С<Р тІС\Ш ї'<д ('\u{e0100}', '\u{e01ef}'),Сќм# ];С4 # };СD ```С< I The static storage used by this data structure scales with the number ofСќL I contiguous ranges of identifier codepoints in Unicode. Every table entryСќщL I consumes 8 bytes, because it consists of a pair of 32-bit `char` values.СќЖL F In some ranges of the Unicode codepoint space, this is quite a sparseСќI H representation – there are some ranges where tens of thousands ofСќбK J adjacent codepoints are all valid identifier characters. In other places,СќM I the representation is quite inefficient. A characater like `ТЕ` (U+00B5)СќыL I which is surrounded by non-identifier codepoints consumes 64 bits in theСќИL 7 table, while it would be just 1 bit in a dense bitmap.Сќ
: Р K On a system with 64-byte cache lines, binary searching the table touches 7СќФN C cache lines on average. Each cache line fits only 8 table entries.СќF K Additionally, the branching performed during the binary search is probablyСќкN . mostly unpredictable to the branch predictor.СќЉ1 л K Overall, the crate ends up being about 10× slower on non-ASCII inputСќпN compared to the fastest crate.СќЎ " б K A potential improvement would be to pack the table entries more compactly.Сќе N L Rust's `char` type is a 21-bit integer padded to 32 bits, which means everyСќЄ!O I table entry is holding 22 bits of wasted space, adding up to 3.9 K. TheyСќє!L J could instead fit every table entry into 6 bytes, leaving out some of theСќС"M M padding, for a 25% improvement in space used. With some cleverness it may beСќ#P H possible to fit in 5 bytes or even 4 bytes by storing a low char and anСќр#K K extent, instead of low char and high char. I don't expect that performanceСќЌ$N M would improve much but this could be the most efficient for space across allСќћ$P 0 the libraries, needing only about 7 K to store.СќЬ%3 & #### ucd-trieС& & H Their data structure is a compressed trie set specifically tailored forСќ&K = Unicode codepoints. The design is credited to Raph Levien inСќц&@ [rust-lang/rust#33098].СмЇ' У' E [rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098СќЧ'H ( Ѓ&\( pub struct TrieSet {СФ ( & tree1_level1: &'static [u64; 32],СќЙ() & tree2_level1: &'static [u8; 992],Сќу() " tree2_level2: &'static [u64],Сќ)% & tree3_level1: &'static [u8; 256],СќГ)) ! tree3_level2: &'static [u8],Сќн)$ " tree3_level3: &'static [u64],Сќ*% }С,Ј* ш(<Ў* Ж* M It represents codepoint sets using a trie to achieve prefix compression. TheСќК*P H final states of the trie are embedded in leaves or "chunks", where eachСќ+K K chunk is a 64-bit integer. Each bit position of the integer corresponds toСќз+N J whether a particular codepoint is in the set or not. These chunks are notСќІ,M L just a compact representation of the final states of the trie, but are alsoСќє,O F a form of suffix compression. In particular, if multiple ranges of 64СќФ-I M contiguous codepoints have the same Unicode properties, then they all map toСќ.P / the same chunk in the final level of the trie.Сќп.2 / K Being tailored for Unicode codepoints, this trie is partitioned into threeСќ/N L disjoint sets: tree1, tree2, tree3. The first set corresponds to codepointsСќх/O C \[0, 0x800), the second \[0x800, 0x10000) and the third \[0x10000,СќЕ0F K 0x110000). These partitions conveniently correspond to the space of 1 or 2Сќќ0N J byte UTF-8 encoded codepoints, 3 byte UTF-8 encoded codepoints and 4 byteСќЫ1M ( UTF-8 encoded codepoints, respectively.Сќ2+ Х2 L Lookups in this data structure are significantly more efficient than binaryСќЩ2O M search. A lookup touches either 1, 2, or 3 cache lines based on which of theСќ3P # trie partitions is being accessed.Сќъ3& 4 M One possible performance improvement would be for this crate to expose a wayСќ4P I to query based on a UTF-8 encoded string, returning the Unicode propertyСќц4L M corresponding to the first character in the string. Without such an API, theСќГ5P K caller is required to tokenize their UTF-8 encoded input data into `char`,Сќ6N J hand the `char` into `ucd-trie`, only for `ucd-trie` to undo that work byСќг6M L converting back into the variable-length representation for trie traversal.СќЁ7O ё7 #### fstСdѕ7 8 I Uses a [finite state transducer][fst]. This representation is built intoСќ8L G [ucd-generate] but I am not aware of any advantage over the `ucd-trie`Сќг8J J representation. In particular `ucd-trie` is optimized for storing UnicodeСќ9M properties while `fst` is not.Сќь9" : ) [fst]: https://github.com/BurntSushi/fstСќ:, ; [ucd-generate]: https://github.com/BurntSushi/ucd-generateСќР:> џ: J As far as I can tell, the main thing that causes `fst` to have large sizeСќ;M J and slow lookups for this use case relative to `ucd-trie` is that it doesСќб;M H not specialize for the fact that only 21 of the 32 bits in a `char` areСќ