staged-sg-filter

Crates.io	staged-sg-filter
lib.rs	staged-sg-filter
version	0.2.3
source	src
created_at	2024-06-06 05:04:42.349783
updated_at	2024-07-22 19:07:59.292254
description	A staged programming implementation for Savitzky-Golay filters. Loops go brrr.
homepage
repository	https://github.com/miguelraz/staged-sg-filter
max_upload_size
id	1263412
size	342,647

Miguel Raz Guzmán Macedo (miguelraz)

documentation

README

Staged Filters

A Savitzky-Golar filter that is fast, baby.
All (N,M) parameters are precomputed and pulled in at compile time.
rayon support is available via a rayon feature flag
Still some SIMD perf left on the table - newer versions will focus on perf
Remember to compile this with RUSTFLAGS="-C target-cpu=native".

This code is based on another code I adapted in Julia with much help from others, see StagedFilters.jl.

Example

Benchmarks

The other savgol-rs implementation offers this speed:

use savgol_rs::*;
fn main() {
    let input = SavGolInput {
        data: &vec![10.0; 500_000],
        window_length: 3,
        poly_order: 1,
        derivative: 0,
    };
    let result = savgol_filter(&input);
    let data = result.unwrap();
    println!("{:?}", &data[0..10]);
}

takes about 52s, whereas this crate

use staged_sg_filter::sav_gol;

fn main() {
    let n = 100_000_000;
    let v = vec![10.0; n];
    let mut buf = vec![0.0; n];
    sav_gol::<1, 1>(&mut buf, &v);

    println!("{:?}", &buf[0..10]);
}

runs in about 200ms in 20x the data size. We're roughly churning through about 100_000_000/0.2 ≈ 5e8 elements per second or 5e8 * 10^-9 ≈ 0.5 elements per nanosecond.

This can still be improved by about a 4x factor, which is the current speed of the Julia code.

Notes

It's called "staged" because the computation is done in "stages", which allows the compiler to optimize the code a lot more - namely, the use of const generics in Rust provide more opportunities for profitable loop unrolling and proper SIMD lane-width usage.

You are expected to have FMA and AVX2 compatible hardware (at least). Compile with RUSTFLAGS="-C target-cpu=native" cargo run --release for best performance.

Decent efforts have been made to ensure

minimal dependencies and fast builds
auto-vectorization fires off with the help of cargo-remark
as much computation is pushed to compile time with the use of precomputed coefficients and const generics
the hot path is allocation and panic-free

Algorithm

Calculate the coefficients of interest in Julia, copy/paste them into coeffs/_f32.rs appropriately and declare them as const.
Do a fixed-size rolling window dot_product with half the elements of the dot product as the coeffs obtained previously.
Update each element of a buffer
Parallelize with Rayon

TODO

rayon support
Just calculate NxM up to 12x12 and cache that
fma support
f32/f64 float support
SIMD support
GPU support / ping Manuel Drehwald
no_std support see(Effective Rust link)
support derivatives (stretch goal - sponsor me???)

Commit count: 0