Crates.io | simd_cesu8 |
lib.rs | simd_cesu8 |
version | 1.0.1 |
source | src |
created_at | 2024-07-19 07:27:00.346647 |
updated_at | 2024-07-19 08:07:59.118794 |
description | An extremely fast, SIMD accelerated, encoding and decoding library for CESU-8 and Modified UTF-8. |
homepage | https://github.com/seancroach/simd_cesu8 |
repository | https://github.com/seancroach/simd_cesu8 |
max_upload_size | |
id | 1308369 |
size | 539,423 |
simd_cesu8
An extremely fast, SIMD-accelerated, encoding and decoding library for CESU-8 and Modified UTF-8.
Add this to your Cargo.toml
file:
[dependencies.simd_cesu8]
version = "1"
features = ["nightly"]
It is extremely recommended to use the nightly feature, as it enables the fastest implementations of the library. If you don't want to use nightly, you can remove the feature and the library will still work, but will be slower.
The documentation can be found on docs.rs.
For quick access to certain functions, I highly recommend pressing the "All Items" button in the top left corner of the documentation page.
The minimum supported Rust version is 1.79.0.
These benchmarks were run with an AMD Ryzen 9 7950X3D on WSL2 with Rust
1.81.0-nightly. Each of the operations were measured with an input exactly equal
to 16,380 bytes. Data sets were randomly generated with the same seed for each
operation. Each set contained 1,000 values, that were endlessly repeated as
input. The benchmarks were run with the criterion
crate. The tables got
created via critcmp
. The source code for the benchmarks can be found in the
benches
directory. Any call to simd_cesu8
was replaced with the cesu8
equivalent when collecting the data for the cesu8
benchmarks. The data
collected can also be found in the benches
directory.
We compare simd_cesu8
to the cesu8
library, as it's the most popular
library for CESU-8 and MUTF-8 encoding and decoding in Rust. I originally
started this project, however, because there's a few minor semantic errors in
the cesu8
library that I wanted to fix myself.
To understand the benchmarks, it's first a good idea to familiarize yourself with the datasets used to benchmark the encoding operations:
ascii_non_null_strings
are strings exactly 16,380 bytes long with only ASCII
characters and no null bytes.ascii_null_alternating_strings
are strings exactly 16,380 bytes long with
only ASCII characters and null bytes alternating, where every odd byte is
non-null and every even byte is null.interspersed_strings
are strings exactly 16,380 bytes long with every 20
bytes in the following pattern:
null_strings
are strings exactly 16,380 bytes long with only null bytes.utf8_clamped_width_2_strings
are strings exactly 16,380 bytes long with only
2-byte UTF-8 characters.utf8_clamped_width_3_strings
are strings exactly 16,380 bytes long with only
3-byte UTF-8 characters.utf8_clamped_width_4_strings
are strings exactly 16,380 bytes long with only
4-byte UTF-8 characters.simd_cesu8::encode
simd_cesu8::encode
is around %2,500 faster than cesu8::to_cesu8
in optimal
conditions. In suboptimal conditions, simd_cesu8::encode
is around %50 faster
than cesu8::to_cesu8
. The main reason for the speedup, beyond SIMD, is
that simd_cesu8
doesn't quickly check for just ASCII characters, but instead,
it checks for the start of 4-byte characters. This is a significant speedup over
the traditional ASCII hot paths. Other speedups come in the form of cumulative,
tiny optimizations that add up to a significant speedup.
simd_cesu8::mutf8::encode
Similar to simd_cesu8::encode
, simd_cesu8::mutf8::encode
is around %1800
faster than cesu8::to_java_cesu8
in optimal conditions. In suboptimal
conditions, simd_cesu8::mutf8::encode
is around %50 faster than
cesu8::to_java_cesu8
. The same reason for the speedup applies here, but
instead of checking for the start of 4-byte characters, we're also checking for
any null bytes with SIMD. Other speedups come in the form of cumulative, tiny
optimizations that add up to a significant speedup.
To understand the benchmarks, it's first a good idea to familiarize yourself with the datasets used to benchmark the decoding operations:
ascii_non_null_bytes
is a set of UTF-8 strings exactly 16,380 bytes long
with only ASCII characters and no null bytes.ascii_null_alternating_bytes
is a set of UTF-8 strings exactly 16,380 bytes
long with only ASCII characters and null bytes alternating, where every odd
byte is non-null and every even byte is null.interspersed_bytes
is a set of UTF-8 strings exactly 16,380 bytes long with
every 20 bytes in the following pattern:
interspersed_cesu8_bytes
is a set of CESU-8 strings exactly 16,380 bytes
long with every 12 bytes in the following pattern:
interspersed_mutf8_bytes
is a set of MUTF-8 strings exactly 16,380 bytes
long with every 26 bytes in the following pattern:
mutf8_null_bytes
is a set of MUTF-8 strings exactly 16,380 bytes long with
only null encoded as a 2-byte sequence.null_bytes
is a set of UTF-8 strings exactly 16,380 bytes long with only
null bytes.surrogate_pairs_bytes
is a set of CESU-8 strings exactly 16,380 bytes long
with only 6-byte surrogate pairs.utf8_clamped_width_2_bytes
is a set of UTF-8 strings exactly 16,380 bytes
long with only 2-byte UTF-8 characters.utf8_clamped_width_3_bytes
is a set of UTF-8 strings exactly 16,380 bytes
long with only 3-byte UTF-8 characters.utf8_clamped_width_4_bytes
is a set of UTF-8 strings exactly 16,380 bytes
long with only 4-byte UTF-8 characters.simd_cesu8::decode
In optimal conditions, simd_cesu8::decode
can be around %1,400 faster than
cesu8::to_cesu8
. However, simd_cesu8::decode
is slower
than cesu8::to_cesu8
when decoding surrogate pairs. The main reason is
that simd_cesu8
supports lossy decoding, but the internals of that seems to
introduce a decent amount of overhead. simd_cesu8
is faster in all ways
that cesu8
when LTO is enabled. However, I didn't run the benchmarks with
LTO enabled because it seems disingenuous to do so as LTO shouldn't be a
requirement for a library to be fast.
simd_cesu8::mutf8::decode
Much like simd_cesu8::decode
, simd_cesu8::mutf8::decode
can be around %1,400
faster than cesu8::to_java_cesu8
. But, the pitfalls are equally the same, as
simd_cesu8::mutf8::decode
is slower than cesu8::to_java_cesu8
when decoding
surrogate pairs. The other catch is that we're slightly slower than cesu8
when
decoding null bytes. The same reasoning applies here, as the overhead is higher.
cesu8
library. It's a genuinely good library that
helped me years ago learn some more complex Rust concepts.simdutf8
for arguably one of the biggest improvements to
the borrow paths in this library.simdnbt
for the inspiration of using portable_simd
.Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.