# Binary Ensemble Compression (BEN) This library is built as an analogue to [PCompress](https://github.com/mggg/pcompress) and is designed to help improve storage of general ensembles of redistricting plans. More specifically, this package is designed to take canonicalized JSONL files which store ensembles using lines of the form ``` {"assignment": , "sample": } ``` and compress them into pure binary files. ## Usage You may install the binary-ensemble package from the cargo package manager using ``` cargo install binary-ensemble ``` [Here](./example/small_example.jsonl) is a link to a small example file that you can use to see what the binary-ensemble package is capable of. Then you can run the package in one of the following modes (assuming that `~/.cargo/bin` is in your path): * Encode ``` ben -m encode small_example.jsonl # Outputs small_example.jsonl.ben ``` * XEncode ``` ben -m x-encode small_example.jsonl # Outputs small_example.jsonl.xben ``` * Decode ``` ben -m decode small_example.jsonl.ben -o re_small_example.jsonl # Outputs re_small_example.jsonl ``` * XDecode ``` ben -m x-decode -p small_example.jsonl.xben # Prints decoding to the console ``` * Read ``` ben -m read -n 4 small_example.jsonl # Outputs [1,1,1,2,2,2,3,2,3,1,4,4,4,3,3,4] ``` * XZCompress ``` ben -m xz-compress small_example.jsonl # Outputs small_example.jsonl.xz ``` * XZDecompress ``` ben -m xz-decompress small_example.jsonl.xz # Outputs small_example.jsonl ``` There is also a `reben` CLI tool that is available through this package, but there is more information about that in the [Relabeling](#relabeling-for-smaller-files) section ## How it works There isn't actually a whole lot of complexity to the algorithm that we employ here; the power of the compression comes from the fact that we use some of the expected information regarding the assignment vectors to shrink the size of the compressed vector substantially. The BEN compression format is a bit-level compression algorithm (as compared to a byte-level compression seen in most compression applications) which works in two stages: 1. Run length-encoding (RLE) 2. Bit compatification The first step is pretty simple: given an assignment like ``` [1,1,1,2,2,2,2,3,1,3,3,3] ``` We can encode this vector using an ordered pair $(value,\, length)$ to get the vector ``` [(1,3), (2,4), (3,1), (1,1), (3,3)] ``` While not very efficient in the above example, in the majority of districting plans, if we order the nodes in the assignment vector according to something geographical, (e.g. GEOID), then the savings can be substantial. The BEN standard then takes this vector and compactifies it into the following series of bytes: ``` 00000010_ <- the maximum number of bits needed to store the assignment values (2) 00000011_ <- the maximum number of bits needed to store the length values (3) 00000000 00000000 00000000 00000100_ <- the number of bytes needed to store the entire vector (4) 01011_101 00_11001_0 1001_1101 1_0000000 <- the bit-packed assignment vector ``` ### Relabeling for Smaller Files Since `ben` uses RLE under the hood, anything that can be done to improve the likelihood of long runs of the same assignment in the output assignment vector will improve the compression substantially. In general, this just means that we need to sort the nodes in the JSON file in such a way that they have a high likelihood of being grouped together in the final assignment. Of course, there is no BEST way to do this, but, generally, sorting the JSON file according to geographic markers is a pretty good way to go. Consider, for example [this](./example/CO_small.json) dual-graph file for the state of Colorado. This is a block-level dual-graph file containing ~140k nodes in no particular order. If we then use this dual-graph file to generate 100k example plans and store the result in an XBEN file, we end up with something like [this](./example/100k_CO_chain.jsonl.xben). While this file is substantially smaller than the original JSONL file (which clocks in at a whopping 27GB), it is still not as small as we might like. However, we can improve the size of these files with a little bit of relabeling. Before we can see the gains that relabeling buy us, we will need to extract the XBEN file back into a BEN file so that we can work with it (WARNING: The BEN file that this will generate is ~7GB, but we will fix that soon), so we will need to run the command ``` ben -m decode 100k_CO_chain.jsonl.xben ``` The first thing that we can do is change up the labeling of our plans so that districts are labeled in the order that they appear in the assignment vector. For example, if we have the assignment vectors ``` [2,2,3,3,1,1,4,4] [2,2,3,3,4,4,1,1] ``` we, as humans can see that these are just the same assignments with some numbers switched around. However, the XBEN compressor (which uses LZMA2 for compression) does not have the context that we have, so, it is necessary for us to help it along a little bit. Here, we can make use of the `reben` (short for relabeling-BEN) CLI tool to do this for us: ``` reben -m ben 100k_CO_chain.jsonl.ben ``` This generally produces an improvement on the XBEN compression without fundamentally altering anything about the underlying data (beyond the relabeling), so it's generally recommended that to run things through `reben` before compressing into an XBEN format. In our running example, we can then compress this file back down to an XBEN format using ``` ben -m x-encode 100k_CO_chain_canonicalized_assignments.jsonl.ben ``` DON'T ACTUALLY DO THIS, IT WILL TAKE OVER AN HOUR!!! **NOTE:** Decoding is fast, but encoding with high compression does take time (maybe an hour or 2 with how big this ensemble is, but that is only because this file is on census blocks. Files with VTDs tend to only take 10 minutes or so.) However, this is not the only thing that we can do to make the file smaller. An often more effective strategy is to sort the file by using some geographical information before we run the chain. Since nearby geographical regions tend to be placed into the same districts as each other, sorting the nodes in the original JSON file according to something like GEOID tends to produce exceptionally short run-length encodings (and thus, exceptionally small BEN files). However, it is not always the case that we have the foresight to do this, so the question then becomes "can we sort the vectors after we have run the simulation already?" to which the answer is "of course!" This is where the other aspects of `reben`come into play. If we would like to produce a new mapping for our dual-graph files so that they are sorted according to some key value then we may use the command ``` reben -m ben -s -k ``` In our example, the CO_small.json file has the GEOID20 key that we would like to sort on, so we call the command ``` reben -m ben -s CO_small.json -k GEOID20 100k_CO_chain_canonicalized_assignments.jsonl.ben ``` This will produce the files * 100k_CO_chain_canonicalized_assignments_sorted_by_GEOID20.jsonl.ben (~550Mb) * CO_small_sorted_by_GEOID20_map.json (a map file containing the new data) * CO_small_sorted_by_GEOID20.json (a dual-graph file with the nodes shifted around) Notice, our BEN file has now shrunk from ~7Gb to around 0.5Gb, which is pretty good! Now, we can further compress this file using the `x-encode` mode of the `ben` CLI ``` ben -m x-encode 100k_CO_chain_canonicalized_assignments_sorted_by_GEOID20.jsonl.ben ``` And this will produce the file `100k_CO_chain_canonicalized_assignments_sorted_by_GEOID20.jsonl.xben` which will only be ~6Mb! That is over a 1000x improvement over the original BEN file, and over a 4500x improvement on the JSONL file! ### Assumptions The BEN compressor does make some assumptions about the data that the user should be aware of: * When using the `ben` CLI tool, it is assumed that the assignments in the assignment vector are stored in the same order as the nodes in some JSON dual-graph file. While this seems to be the standard, it is incumbent on the user to make sure that they know which dual-graph file / node labeling produced these assignment vectors. * When the samples are encoded into BEN formatting, the decoded samples will always start at 1. * The maximum value of any assignment value is assumed to be 65535 (the largest number that can be stored in 16 bits) and, likewise, the longest a single assignment run length is assumed to be 65535. In all practical applications, this should not cause any issues unless the user is specifically looking at ways to split Idaho into congressional districts at the census block level between the years 2010-2020 and they make the decision to sort the dual-graph file according to the congressional assignments (and if you are doing that, 1. why? and 2. maybe sort the dual-graph file in some other way that is more meaningful using `reben` first). None of the other states can cause any issues for any of the state-wide races. * The computer that is applying the BEN and XBEN encoding and decoding algorithms is assumed to have sufficient memory to store an entire assignment vector. This should not be an issue since any computer that is trying to extract information from these files is presumably also doing analysis with the ensemble of plans, but it is worth mentioning. ## More on the XBEN format The XBEN (short for eXtreme-BEN) format is the ideal format to use for large data storage and for the transferring of ensembles from user-to-user. Compared to BEN, XBEN uses an implementation of [LZMA2](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) to further take advantage of the repetition often present across ensembles of redistricting plans to further reduce the size of the file. This, as one might think from the name, the Lempel-Ziv Markov Algorithm (named because it uses a Markov Algorithm under the hood, not because it is good at compressing data arising from a Markov process), is particularly good at improving the compression ration of ensembles generated by Markov methods like GerryChain or Forest Recom. It also works well at generally improving the compression of BEN files. The LZMA2 algorithm is based off of the [LZ-77](https://en.wikipedia.org/wiki/LZ77_and_LZ78#LZ77) algorithm which replaces repeated occurrences of data with a single copy of that data appearing earlier in the data stream. LZMA2 then uses a Markov process to dynamically determine the best Variable-Length Encoding (VLE) that can be assigned to the most frequently occurring sequential bytes of data that appear within a data stream, and then encodes the data using these codes (for a simplified idea of what is going on here, see [Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding)). Of course, in order for the LZMA2 to work, we need for the data to be encoded using bytes instead of the bit-packing method employed by BEN, so, when converting from a BEN file to an XBEN file (the bit packing tends to produce data that looks like a random series of bits which is generally incompressable), we actually unpack each assignment vector into an intermediate format known as BEN32, which is an RLE that uses 32 bits to encode each assignment vector (using a big-endian formatted u16 for the value and a big-endian formatted u16 for the length and a null u32 as the separator). The LZMA2 algorithm is then able to make use of the repetition of the bytes across the BEN32 file to substantially improve the compression. **NOTE:** The decompression from the BEN to the BEN32 format only processes one RLE assignment vector at a time, so the memory requirement for the extra compression using LZMA2 is not that large. ## When to use ben vs xben The BEN file format is designed to be a usable format for working with ensembles of plans. That is to say, it comes with some auxiliary functionality that allows for the user to read off the assignment vector for a particular sample number easily, and thus can be used to "replay" an ensemble coming from a Markov chain if desired. The XBEN file format is designed to be used for storage only. The `ben` CLI tool has been built with an emphasis on fast decompression, so any file that is stored as an XBEN file can quickly (in under 5 minutes) be extracted into a usable BEN format. Of course, the trade-off for this is that the compression itself is fairly slow, and can sometimes take several hours to finish if the data is not relabeled to improve the efficiency. However, considering any method used to create an ensemble of plans is likely to take several hours anyway, the additional compression time to get a small XBEN file should be mostly inconsequential in the grand scheme of things. ## Limitations Since the BEN format and CLI tool is designed to work with general ensembles of districting plans, it does come with some limitations. First and foremost, while BEN excels at the storage of ensembles of plans built on census blocks, the compression ratios tend to be smaller when considering ensembles of plans for things like VTDs or Tracts. In practice, since the assignment vectors for these plans do not tend to be very long (maybe 10-20k), this is not that big of an issue, but it is worth keeping that in mind. In the event that an exceptionally small file is needed for compressing a districting ensemble arising from a Markov Chain Monte Carlo method (e.g. Recom or Forest Recom) on larger subunits like tracts, the [PCompress](https://github.com/mggg/pcompress) format, which employs a byte-level delta encoding, is a good alternative choice.