Crates.io | byte-size |
lib.rs | byte-size |
version | 0.2.7 |
source | src |
created_at | 2023-02-01 16:12:16.36848 |
updated_at | 2023-02-06 01:21:18.005085 |
description | An effective short string shrinker with total disregard for speed, memory usage and executable size |
homepage | |
repository | https://github.com/ray33ee/byte-size |
max_upload_size | |
id | 773893 |
size | 5,168,029 |
A short string compressor/decompressor that can store 20,000+ words in three bytes or less.
Similar to smaz, byte-size is able to compress small strings, something that other conventional compression algorithms struggle with.
However, byte-size is typically better than smaz, certainly for very commonly used words (out of 10000 most common words, less than 1% had better compression with smaz) byte-size can also represent numbers, repeated sequences and non-alphanumeric characters more efficiently than smaz. It can encode unicode characters, but not very efficiently. If your text includes a few unicode characters it should still compress better, but if your strings are mostly unicode characters, other schemes such as Unishox are better.
byte-size uses several tables with over 18000 total entries. Obviously this will incur a large runtime memory and binary file size cost, but if you have the memory available, it is worth it to compress more effectively.
Using examples directly from smaz we have:
[Insert examples]
We can see how every example is compressed more with byte-size than smaz.
It's basically just two tables one of about a thousand most commonly used lemmas (expressible as 2 bytes) and another of 10s of thousands of lemmas (expressible as 3 bytes)
On top of that, we have a few commonly used 2 and 3 byte sequences expressible as just 1 byte, that can be used as lemma prefix/sufixes, or can be used to construct words not in either list.
There are 3 lists:
These lists are stored in the package root directory. These lists can be modified and these modifications will work. Lists are represented as a file, where each line is a new lemma encoded via percent encoding (to allow non printable characters and unicode sequences)
The Snaz encoding is as follows: