Crates.io | slipstream |
lib.rs | slipstream |
version | 0.2.1 |
source | src |
created_at | 2020-06-23 19:27:02.412298 |
updated_at | 2023-02-10 20:13:22.099022 |
description | SIMD library usable by the masses |
homepage | |
repository | http://github.com/vorner/splitstream |
max_upload_size | |
id | 257256 |
size | 145,687 |
This library helps writing code in a way that incentives the compiler to optimize the results better (without really doing anything itself).
Modern compilers, including rustc
, are able to come up with impressive ways to
speed up the resulting code, using techniques like loop unrolling and
autovectorization, routinely outperforming what one would hand-craft.
Nevertheless, each optimisation has some assumptions that must be proven to hold
before it can be applied.
This library offers „vector“ types, like u16x8
, which act in a very similar
way as little fixed-sized arrays (in this case it would be [u16; 8]
), but with
arithmetics defined for them. They also enforce alignment of the whole vectors.
Therefore, one can write the algorithm in a way that works on these groups of
data and make it easier for the compiler to prove the assumptions. This can
result in multiple factor speed ups by giving the compiler these proofs „for
free“ and allowing it to apply aggressive optimizations.
The API is inspired by the packed_simd
and faster
crates, but as it
relies on the autovectorizer instead of using explicit SIMD instructions, it
works on stable rust, allows speed ups even on platforms that don't have
explicit SIMD support from the rust standard library (or no SIMD support at
all).
The downside is the optimizations are not guaranteed. While it oftentimes produces results competitive or even better than hand-crafted vectorized code, a small change to surrounding code can also lead to much worse results. You're advised to apply this to only tight loops with enough data to crunch and to measure the performance.
It goes well together with function multiversioning, see for example the
multiversion
crate.
More details can be found in the documentation, including tips for effective use and what to try if the performance isn't as good as expected.
As a very simple example, imagine that the crux of the application's performance is summing a huge array of floats and we have this code:
fn compute(d: &[f32]) -> f32 {
d.iter().sum()
}
Now, one could rewrite it to something like this, using manual vectorization:
use core::arch::x86_64 as arch;
unsafe fn compute_sse(d: &[f32]) -> f32 {
let mut result = arch::_mm_setzero_ps();
let iter = data.chunks_exact(4);
let remainder = iter.remainder().iter().sum::<f32>();
for v in iter {
result = arch::_mm_add_ps(result, arch::_mm_loadu_ps(v.as_ptr()));
}
let result: [f32; 4] = mem::transmute(result);
let result = result.iter().sum::<f32>() + remainder;
}
And while this does result in significant speedup, it's also much less readable, one has to allow using unsafe through the application logic and is not portable (it won't run on anything that's not Intel and it won't take advantage of newer and better vector instructions even there). These downside usually make it not worth pursuing for more complex algorithms.
Using slipstream
, one can also write this:
fn compute_slipstream(d: &[f32]) -> f32 {
// Will split the data into vectors of 4 lanes, padding the last one with
// the lanes from the provided parameter.
d.vectorize_pad(f32x4::default())
// Sum the vectors into a final vector
.sum::<f32x4>()
// Sum the lanes of the vectors together.
.horizontal_sum()
}
This is still longer and more complex than the original, but seems much more manageable than the manual version. It's also portable and might provide some speedup on platforms that don't have any vector instructions. Using the right annotations on the function, one is also able to generate multiple versions and dispatch the one that takes advantage of the newest and shiniest instructions the CPU supports at runtime.
Corresponding benchmarks on i5-8265U suggest that this version comes close to the manual one. Indeed, there are similar variants that are even faster.
test sum::basic ... bench: 11,707,693 ns/iter (+/- 261,428)
test sum::manual_sse_convert ... bench: 3,000,906 ns/iter (+/- 535,041)
test sum::vectorize_pad_default ... bench: 3,141,834 ns/iter (+/- 81,376)
Note: to re-run the benchmarks as above, use type V = f32x4
in
benches/utils.rs
.
Warning: Floats are not associative. The first, manual, version may produce slightly different results because of rounding errors.
It is an open source library and help in developing it is welcome. There are some areal where Your contribution would be especially appreciated:
f32x4 -> f64x4
).f32x2::splat(-1.0) * f32x2::new([1, 2])
, but it would be more comfortable
if it could be just written as -1.0 * f32x2::new([1, 2])
.vectorize_pad
method seems surprisingly slow,
ideally it would produce code with comparable speed to vectorize
.unsafe
code. This was
oftentimes written that way because of performance ‒ for example, initializing
the GenericArray
from an iterator prevented a lot of optimisations and led
to significantly inferior performance. Optimally, each such unsafe
code
would get replaced by safe code, or would get a comment explaining/proving
that it is indeed safe.If you want to work on anything bigger, it's a good idea to open an issue on the repository to both discuss it first and to reserve the task.
Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.