extsort-iter

Crates.ioextsort-iter
lib.rsextsort-iter
version0.3.1
sourcesrc
created_at2022-09-24 17:59:36.997963
updated_at2023-09-30 10:59:55.161968
descriptionexternal sorting for all types for all iterators
homepage
repositoryhttps://github.com/fegies/extsort-iter
max_upload_size
id673140
size82,125
Felix Giese (fegies)

documentation

https://github.com/fegies/extsort-iter

README

extsort-iter license Rust docs.rs Crates.io

Sort iterators of any type as long as they fit on your disk!

extsort-iter is a library that allows you to externally sort any iterator with no serialization steps required.

Conceptually, we are taking your input iterator, buffering it, sorting the runs, moving the data to disk before streaming it back in order.

Why this crate?

Because we offer a very nice high level API and perform our sort without any (de)serialization overhead.

Usage

let data = [1,42,3,41,5];

// lets pretend this is an iterator that will not fit in memory completely
let source = data.into_iter(); 
let config = ExtsortConfig::default();
let iterator = data.external_sort(config);

// do whatever you want with the sorted iterator
for item in iterator { 
    println!("{}", item);
}

// alternatively you can provide a comparison function
// (here one that sorts 42 to the beginning because it is the best number)
let iterator = data.external_sort_by(config, |a,b| if a == 42 {
    Ordering::Less
}else {
    a.cmp(b)
});

// or you can provide a key extraction function to sort by
// (here we are sorting by the number of trailing ones)
let iterator = data.external_sort_by_key(config, |a| a.trailing_ones());

// if you can spare the memory, you should increase the size of the in-memory sort buffer.
// it defaults to 10MB.
// larger buffer sizes will drastically improve your sort performance, because only the
// in-memory sort part is parallelized and the IO becomes more sequential
let config = ExtsortConfig::with_buffer_size(1_073_741_824);

If you enable the parallel_sort feature, parallel versions of all sort function will become available that sort the run buffer in parallel using rayon:

// normal sort by the ord impl
let iterator = data.par_external_sort(config);

// sort using a comparison function
let iterator = data.par_external_sort_by(config, |a,b| if a == 42 {
    Ordering::Less
}else {
    a.cmp(b)
});

// parallel sort using key extraction function
let iterator = data.par_external_sort_by_key(config, |a| a.trailing_ones());

If you enable the compress_lz4_flex feature, you can enable transparent compression of your data for IO purposes:

let config = ExtsortConfig::default().compress_lz4_flex();

// and then just sort as normal
let iterator = data.par_external_sort(config);

When not to use this crate

When your source iterator is big because each item owns large amounts of heap memory. That means the following case will result in memory exhaustion:

let data = "somestring".to_owned();
let iterator = std::iter::from_fn(|| Some(data.clone())).take(1_000_000);
let sorted  = iterator.external_sort(ExtsortConfig::default_for::<String>());

The reason for that is that we are not dropping the values from the source iterator until they are yielded by the result iterator.

You can think of it as buffering the entire input iterator, with the values themselves living on disk but all memory the values point to still living on the heap.

Unsafe Usage

This crate uses unsafe code to view a run buffer as a byteslice to copy it to disk and to read data from disk back to memory and treat is as our values again.

All unsafe code is limited to the file_run module, is fairly well documented and tested and the testsuite passes when run with miri, so we are as sure as we can reasonably be about the code being correct and sound.

Commit count: 36

cargo fmt