Crates.io | simdutf8 |
lib.rs | simdutf8 |
version | 0.1.5 |
source | src |
created_at | 2021-04-20 19:11:04.474217 |
updated_at | 2024-09-22 09:18:22.610907 |
description | SIMD-accelerated UTF-8 validation. |
homepage | https://github.com/rusticstuff/simdutf8 |
repository | https://github.com/rusticstuff/simdutf8 |
max_upload_size | |
id | 387267 |
size | 129,856 |
Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs, but now heavily improved.
This library has been thoroughly tested with sample data as well as fuzzing and there are no known bugs.
basic
API for the fastest validation, optimized for valid UTF-8compat
API as a fully compatible replacement for std::str::from_utf8()
Add the dependency to your Cargo.toml file:
[dependencies]
simdutf8 = "0.1.5"
Use simdutf8::basic::from_utf8()
as a drop-in replacement for std::str::from_utf8()
.
use simdutf8::basic::from_utf8;
println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());
If you need detailed information on validation failures, use simdutf8::compat::from_utf8()
instead.
use simdutf8::compat::from_utf8;
let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));
Use the basic
API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. simdutf8::basic::Utf8Error
is a zero-sized error struct.
The compat
flavor is fully API-compatible with std::str::from_utf8()
. In particular, simdutf8::compat::from_utf8()
returns a simdutf8::compat::Utf8Error
, which has valid_up_to()
and error_len()
methods. The first is useful for
verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.
It also fails early: errors are checked on the fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a slight performance penalty compared to the basic
API even if the input is valid UTF-8.
The fastest implementation is selected at runtime using the std::is_x86_feature_detected!
macro, unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with RUSTFLAGS="-C target-cpu=native"
on a recent x86-64 machine, the AVX 2 implementation is selected at
compile-time and runtime selection is disabled.
For no-std support (compiled with --no-default-features
) the implementation is always selected at compile time based on
the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2"
for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2"
for the SSE 4.2 implementation.
The SIMD implementation is used automatically since Rust 1.61.
For wasm32 support, the implementation is selected at compile time based on the presence of the simd128
target feature.
Use RUSTFLAGS="-C target-feature=+simd128"
to enable the WASM SIMD implementation. WASM, at
the time of this writing, doesn't have a way to detect SIMD through WASM itself. Although this capability
is available in various WASM host environments (e.g., wasm-feature-detect in the web browser), there is no portable
way from within the library to detect this.
See this document for more details.
If you want to be able to call a SIMD implementation directly, use the public_imp
feature flag. The validation implementations are then accessible in the simdutf8::{basic, compat}::imp
hierarchy. Traits
facilitating streaming validation are available there as well.
Do not use opt-level = "z"
, which prevents inlining and makes
the code quite slow.
This crate's minimum supported Rust version is 1.38.0.
The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory.
The naming schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very
first character is invalid UTF-8. Library versions are simdutf8 v0.1.2 and simdjson v0.9.2. When comparing
with simdjson simdutf8 is compiled with #inline(never)
.
Configurations:
Simdutf8 is up to 23 times faster than the std library on valid non-ASCII, up to four times on pure ASCII.
Simdutf8 is up to to eleven times faster than the std library on valid non-ASCII, up to four times faster on pure ASCII.
Simdutf8 is faster than simdjson on almost all inputs.
There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark.
For inputs shorter than 64 bytes validation is delegated to core::str::from_utf8()
except for the direct-access
functions in simdutf8::{basic, compat}::imp
.
The SIMD implementation is mostly similar to the one in simdjson except that it is has additional optimizations for the pure ASCII case. Also it uses prefetch with AVX 2 on x86 which leads to slightly better performance with some Intel CPUs on synthetic benchmarks.
For the compat API, we need to check the error status vector on each 64-byte block instead of just aggregating it. If an
error is found, the last bytes of the previous block are checked for a cross-block continuation and then
std::str::from_utf8()
is run to find the exact location of the error.
Care is taken that all functions are properly inlined up to the public interface.
to the authors of simdjson for coming up with the high-performance SIMD implementation and in particular to Daniel Lemire for his feedback. It was very helpful.
to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.
This code is dual-licensed under the Apache License 2.0 and the MIT License.
It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license as well.
simdjson itself is distributed under the Apache License 2.0.
John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021