rscompress

A compression library in Rust with focus on scientific data.

## Disclaimer This is a rewrite and merge of several compression algorithms developed during my time as a phd student: - https://github.com/ucyo/huffman - https://github.com/ucyo/pzip-cli - https://github.com/ucyo/pzip-bwt - https://github.com/ucyo/pzip-redux - https://github.com/ucyo/pzip-huffman - https://github.com/ucyo/pzip - https://github.com/ucyo/rust-compress - https://github.com/ucyo/adaptive-lossy-compression - https://github.com/ucyo/information-spaces - https://github.com/ucyo/cframework - https://github.com/ucyo/xor-and-residual-calculation - https://github.com/ucyo/climate-data-analysis The dissertation can be downloaded from https://doi.org/10.5445/IR/1000105055 ## Architecture The library is split into one base and four supporting libraries. The base library orchestrates the supporting libraries. All compression algorithms follow the same basic structure: 1. Decorrelate data using transformations 2. Approximate the data, if lossy compression is needed 3. Code the data Additionally, check if each step executed as expected. ``` +----------------+ lossless +----------+ | | | | Start +------> | Transformation | +------------> | Coding | +------> End | | | | +----------------+ +----------+ + ^ | | | lossy | | | v | | +---------------+ | | | | | Approximation | +------------------------+ | | +---------------+ ``` This library will follow the same principles. ### Transformations Transformations are algorithms which represent the same information using a different alphabet. Good transformation algorithms eliminate redundant information in the data. A mathematical function can be seen as a transformation of a series of data. The series `1 1 2 3 5 8 13 21 ..` can expressed as `f(x) = f(x-1) + f(x-2)`. We mapped the information represented in alphabet A (integers) to an alphabet B (letters + integers) which is more compact. It is important to note that all transformations must have two properties: - Applying a transformation algorithm to data, does not loose information. - All transformation algorithms are reversible, such that the original representation can be reconstructed from the new alphabet. ### Approximations Approximations are algorithms which loose information for the sake of better compression. Given a threshold `theta` (this can be absolute or relative), the algorithm maps the data from alphabet A to B with an information lose within the expected threshold. An example for an approximation is the `~=` operator known from primary school e.g. `1/3 ~= 0.3`. Approximations have the following properties: - Applying an approximation algorithm to data, results in information loss. - Approximation algorithms are not reversible. - The information loss is guaranteed to be within the threshold `theta` ### Codings Codings are algorithms where the actual compression happens. The information is being saved on disk as compact as possible. Examples are [Huffman](https://en.wikipedia.org/wiki/Huffman_coding) or [Arithmetic](https://en.wikipedia.org/wiki/Arithmetic_coding) coding. ### Checksums Checksums are algorithms to check the integrity of the data at each step e.g. [Adler-32](https://en.wikipedia.org/wiki/Adler-32).