{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: Compressing Natural Language With an Autoregressive Machine Learning Model and `constriction`\n", "\n", "- **Author:** Robert Bamler, University of Tuebingen\n", "- **Initial Publication Date:** Jan 7, 2022\n", "\n", "This is an interactive jupyter notebook.\n", "You can read this notebook [online](https://github.com/bamler-lab/constriction/blob/main/examples/python/03-tutorial-autoregressive-nlp-compression.ipynb) but if you want to execute any code, we recommend to [download](https://github.com/bamler-lab/constriction/raw/main/examples/python/03-tutorial-autoregressive-nlp-compression.ipynb) it.\n", "\n", "More examples, tutorials, and reference materials are available at .\n", "\n", "## Introduction\n", "\n", "This notebook teaches you how to losslessly compress and decompress data with `constriction` using an autoregressive machine learning model.\n", "We will use the simple character based recurrent neural network toy-model for natural language from the [Practical PyTorch Series](https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb).\n", "You won't need any fancy hardware (like a GPU) to go along with this tutorial since the model is so simplistic that it can be trained on a normal consumer PC in about 30 minutes.\n", "\n", "The compression method we develop in this tutorial is for demonstration purpose only and not meant as a proposal for practical text compression.\n", "While you will find that our method can compress text that is similar to the training data very well (with a 40% reduction in bitrate compared to `bzip2`), you'll also find that it will generalize poorly to other text forms and that compression and decompression are excruciatingly slow.\n", "The poor generalization performance is not a fundamental issue of the presented approach; it is just a result of using a very simplistic model combined with a very small training set.\n", "The runtime performance, too, would improve to *some* degree if we used a better suited model and if we ported the implementation to a compiled language and avoid permanent data-copying between Python, PyTorch, and `constriction`.\n", "However, even so, autoregressive models tend to be quite runtime-inefficient in general due to poor parallelizability.\n", "An alternative to autoregressive models for exploiting correlations in data compression is the so-called bits-back technique.\n", "An example of bits-back coding with `constriction` is provided in [this problem set](https://robamler.github.io/teaching/compress22/problem-set-07.zip) (with [solutions](https://robamler.github.io/teaching/compress21/problem-set-07-solutions.zip))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Prepare Your Environment\n", "\n", "We'll use the PyTorch deep learning framework to train and evaluate our entropy model.\n", "Follow the [installation instructions](https://pytorch.org/get-started/locally/) (you'll save a *lot* of download time and disk space if you install the CPU-only version).\n", "Then restart your jupyter kernel and test if you can import PyTorch:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import torch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, make sure you have a recent enough version of `constriction` and a few other small Python packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install constriction~=0.3.5 tqdm~=4.62.3 unidecode~=1.3.2" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Restart your jupyter kernel once the above installation is done.**\n", "Then, let's get started." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Get Some Training Data\n", "\n", "We'll train our text model on the same [100,000 character Shakespeare sample](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) that was used in the [Practical PyTorch Series](https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb).\n", "Different from the Practical PyTorch Series, however, we'll split the data set into a training set and a test set so that we can test our compression method on text on which the model wasn't explicitly trained.\n", "\n", "Start with downloading the full data set:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-01-08 21:49:29-- https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1115394 (1,1M) [text/plain]\n", "Saving to: ‘shakespeare_all.txt’\n", "\n", "shakespeare_all.txt 100%[===================>] 1,06M 6,37MB/s in 0,2s \n", "\n", "2022-01-08 21:49:29 (6,37 MB/s) - ‘shakespeare_all.txt’ saved [1115394/1115394]\n", "\n" ] } ], "source": [ "!wget -O shakespeare_all.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's split the data set into, say, 90% training data and 10% test data by randomly assigning each line of the full data set to either one of those two subsets.\n", "In a real scientific project, you should usually split into more than just two sets (e.g., add an additional \"validation\" set that you then use to tune model hyperparameters).\n", "But we'll keep it simple for this tutorial." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of non-empty lines in the data set: 32777\n", "File `shakespeare_train.txt` has 29524 lines (90.1%).\n", "File `shakespeare_test.txt` has 3253 lines (9.9%).\n" ] } ], "source": [ "import numpy as np\n", "\n", "target_test_set_ratio = 0.1 # means that about 10% of the lines will end up in the test set.\n", "train_count, test_count = 0, 0\n", "rng = np.random.RandomState(830472) # (always set a random seed to make results reproducible)\n", "\n", "with open(\"shakespeare_all.txt\", \"r\") as in_file, \\\n", " open(\"shakespeare_train.txt\", \"w\") as train_file, \\\n", " open(\"shakespeare_test.txt\", \"w\") as test_file:\n", " for line in in_file.readlines():\n", " if line == \"\\n\" or line == \"\":\n", " continue # Let's leave out empty lines\n", " if rng.uniform() < target_test_set_ratio:\n", " test_file.write(line)\n", " test_count += 1\n", " else:\n", " train_file.write(line)\n", " train_count += 1\n", "\n", "total_count = train_count + test_count\n", "print(f\"Total number of non-empty lines in the data set: {total_count}\")\n", "print(f\"File `shakespeare_train.txt` has {train_count} lines ({100 * train_count / total_count:.1f}%).\")\n", "print(f\"File `shakespeare_test.txt` has {test_count} lines ({100 * test_count / total_count:.1f}%).\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Train The Model\n", "\n", "We'll use the toy model described in [this tutorial](https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb).\n", "Luckily, someone has already extracted the relevant code blocks from the tutorial and built a command line application around it, so we'll just use that.\n", "\n", "Clone the repository into a subdirectory `char-rnn.pytorch` right next to this notebook:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloning into 'char-rnn.pytorch'...\n", "remote: Enumerating objects: 54, done.\u001b[K\n", "remote: Total 54 (delta 0), reused 0 (delta 0), pack-reused 54\u001b[K\n", "Receiving objects: 100% (54/54), 11.79 KiB | 5.89 MiB/s, done.\n", "Resolving deltas: 100% (30/30), done.\n" ] } ], "source": [ "!git clone https://github.com/spro/char-rnn.pytorch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the time of writing, the `char-rnn.pytorch` repository seems to have a small bug that will result in an exception being raised when training the model (a pull request that fixes it [exists](https://github.com/spro/char-rnn.pytorch/pull/10)).\n", "But we can fix it easily.\n", "Open the file `char-rnn.pytorch/train.py` in a text editor, look for the line in the `train` function that reads\n", "\n", "```python\n", " return loss.data[0] / args.chunk_len\n", "```\n", "\n", "and replace \"`[0]`\" with \"`.item()`\" so that the line now reads:\n", "\n", "```python\n", " return loss.data.item() / args.chunk_len\n", "```\n", "\n", "If you can't find the original line then the bug has probably been fixed in the meantime.\n", "\n", "Now, start the training procedure.\n", "**This will take about 30 minutes** but you'll only have to do it once since the trained model will be saved to a file." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training for 2000 epochs...\n", " 5%|██ | 99/2000 [01:57<23:18, 1.36it/s][1m 58s (100 5%) 1.8286]\n", "Whird soul the dight.\n", "What cit sust that a fantelfore ain tres, saine's mess honous herward 'y borrast \n", "\n", " 10%|███▉ | 199/2000 [03:53<58:27, 1.95s/it][3m 54s (200 10%) 1.5965]\n", "What gracious. When shall am intters?\n", "GLOUCESTER:\n", "So for his wing that worrick thee\n", "To at heart, and h \n", "\n", " 15%|█████▉ | 299/2000 [05:48<21:45, 1.30it/s][5m 49s (300 15%) 1.5209]\n", "Where is these bluest?\n", "GLOUCESTER: he will of have mercight.\n", "My honours must dedther shall smerd, free \n", "\n", " 20%|███████▉ | 399/2000 [07:36<20:12, 1.32it/s][7m 37s (400 20%) 1.5081]\n", "What is grace be such receeds are counseal do limptrance,\n", "And make out the rebed you grant and in the \n", "\n", " 25%|█████████▉ | 499/2000 [09:22<20:48, 1.20it/s][9m 23s (500 25%) 1.4705]\n", "Which have subbood for a pursup,\n", "Therefore for him and true what and care, the king's stard himsure;\n", "C \n", "\n", " 30%|███████████▉ | 599/2000 [11:09<18:21, 1.27it/s][11m 10s (600 30%) 1.4360]\n", "What shall not love the me\n", "Master than the scrodned pale soon in my suitment\n", "Suchards younger'd from y \n", "\n", " 35%|█████████████▉ | 699/2000 [12:34<15:55, 1.36it/s][12m 35s (700 35%) 1.4343]\n", "Wherefore, prepare, and a supply of Very country,\n", "And thou wixed petroping, then you must not a power. \n", "\n", " 40%|███████████████▉ | 799/2000 [14:20<14:26, 1.39it/s][14m 21s (800 40%) 1.3995]\n", "What you may she hath lie.\n", "I have golden to him the handle hope and to fear.\n", "CLAUDIO:\n", "That accoudselve \n", "\n", " 45%|█████████████████▉ | 899/2000 [16:05<34:41, 1.89s/it][16m 6s (900 45%) 1.4318]\n", "What, old for your loss;\n", "And yet come was he think to the sead, you give heaven\n", "My send of blessed to \n", "\n", " 50%|███████████████████▉ | 999/2000 [17:52<28:03, 1.68s/it][17m 53s (1000 50%) 1.4020]\n", "When, being follow in this:\n", "Whilst thou duke of this new sudden?\n", "So to his son!\n", "QUEEN:\n", "I pray you, goo \n", "\n", " 55%|█████████████████████▍ | 1099/2000 [19:40<15:33, 1.04s/it][19m 41s (1100 55%) 1.4087]\n", "Where is heard of those deliver\n", "As with pithis earth is since thou know and I\n", "They from her reports at \n", "\n", " 60%|███████████████████████▍ | 1199/2000 [21:44<20:06, 1.51s/it][21m 45s (1200 60%) 1.4010]\n", "Why more honour's man,\n", "In thee you am that say Eas thou hast that\n", "Then come what shall be all the ligh \n", "\n", " 65%|█████████████████████████▎ | 1299/2000 [23:47<14:21, 1.23s/it][23m 48s (1300 65%) 1.3946]\n", "Which you had we seems,\n", "And what do I since it is in eye:\n", "If I beg I set our holy too soul,\n", "And make a \n", "\n", " 70%|███████████████████████████▎ | 1399/2000 [25:43<11:09, 1.11s/it][25m 44s (1400 70%) 1.4027]\n", "Wherefore too then all.\n", "CAMILLO:\n", "From those shall say you wicke to his vantad\n", "Hath cell your susport y \n", "\n", " 75%|█████████████████████████████▏ | 1499/2000 [27:20<05:26, 1.53it/s][27m 21s (1500 75%) 1.3744]\n", "Where so? Fly, I will you do hollow.\n", "LUCIO:\n", "But to know the letter's stord composition have mine\n", "The f \n", "\n", " 80%|███████████████████████████████▏ | 1599/2000 [28:27<03:44, 1.79it/s][28m 28s (1600 80%) 1.3940]\n", "What mayst true?\n", "ARIEL:\n", "GLOUCESTER:\n", "How nobles confess, what ere I will doon this dumbly?\n", "JULIET:\n", "No, \n", "\n", " 85%|█████████████████████████████████▏ | 1699/2000 [29:47<03:29, 1.44it/s][29m 48s (1700 85%) 1.3568]\n", "When I procuir than bearing crown.\n", "MONTAGUE:\n", "We are time he out of me to princely own.\n", "I am sin in who \n", "\n", " 90%|███████████████████████████████████ | 1799/2000 [30:57<01:41, 1.97it/s][30m 58s (1800 90%) 1.3615]\n", "Whwife duke unsle?\n", "Here is now here by the cause.\n", "Clown:\n", "The court of spident up thee from the wars' t \n", "\n", " 95%|█████████████████████████████████████ | 1899/2000 [31:57<00:56, 1.78it/s][31m 57s (1900 95%) 1.3728]\n", "What call the send him yours?\n", "And though they be so molongs thee in her peery\n", "Is ring proved me from m \n", "\n", "100%|██████████████████████████████████████▉| 1999/2000 [32:52<00:00, 1.88it/s][32m 52s (2000 100%) 1.3478]\n", "Which a charms for rain cannot here.\n", "Is with her hands, thro bear my gates thee,\n", "So fain your soul, fo \n", "\n", "100%|███████████████████████████████████████| 2000/2000 [32:52<00:00, 1.01it/s]\n", "Saving...\n", "Saved as shakespeare_train.pt\n" ] } ], "source": [ "!python char-rnn.pytorch/train.py shakespeare_train.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above training script should print a few lines of text after each completed 5% of training progress.\n", "The generated text snippets are random samples from the trained model.\n", "You should be able to observe that the quality of the sampled text should improve as training proceeds.\n", "\n", "At the end of the training cycle, the generated text will not be perfect, but that's OK for the purpose of compression.\n", "When we'll use the trained model to compress some new text below, we won't be blindly following the model's predictions as in the randomly generated text here.\n", "Instead, we'll compare the model's predictions with the actual text that we want to compress, and we'll then essentially encode the difference.\n", "Thus, model predictions don't need to be perfect; as long as they're better than a completely random (uniform) guess, they will allow us to reduce the bitrate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: While the Model is Being Trained ...\n", "\n", "While you wait for the model to be trained, let's start thinking about how we'll use the trained model for our compression method.\n", "\n", "### Overall Encoder/Decoder Design\n", "\n", "The model is an autoregressive model that processes one character after the other.\n", "Each step takes the previous character as input, updates some internal hidden state, and outputs a probability distribution over all possible values for the next character.\n", "\n", "This autoregressive model architecture pretty much dictates how our compression method must operate:\n", "\n", "- The *encoder* reads the message that we want to compress character by character and updates the autoregressive model with each step.\n", " It uses the output of the model at each step to define an entropy model for encoding the next character.\n", "- The *decoder* decodes one character at a time and uses it to perform the exact same model updates as the encoder did.\n", " The entropy model used for each decoding step is defined by the model output from the previous step.\n", "\n", "The very first model update on both encoder and decoder side is a bit subtle because we don't yet have a \"previous\" character to provide as input for the model update here.\n", "We'll just make up a fake character that we pretend exists before the start of the actual message, and we'll inject this fake character at the very beginning to both the encoder and the decoder.\n", "Let's use the newline character `\"\\n\"` for this purpose since this seems like a character that could indeed naturally precede the beginning of any message.\n", "\n", "Since encoding and decoding iterate over the characters in the same order, we'll need to use an entropy coder with *queue* semantics (i.e., \"first in first out\").\n", "The `constriction` library provides two entropy coders with queue semantics: [Range Coding](https://bamler-lab.github.io/constriction/apidoc/python/stream/queue.html) and [Huffman Coding](https://bamler-lab.github.io/constriction/apidoc/python/symbol.html).\n", "We'll use Range Coding here since it has better compression performance.\n", "At the end of this notebook you'll find an empirical comparison to Huffman Coding (which will turn out to perform worse).\n", "\n", "### How to Implement Iterative Model Updates\n", "\n", "So how are we going to perform these iterative model updates described above in practice?\n", "Let's just start from how it's done in the the implementation of the model (which you cloned into the subdirectory `char-rnn.pytorch` in Step 3 above) and adapt it to our needs.\n", "The file `generate.py` seems like a good place to look for inspiration since generating random text from the model is not all that different from what we're trying to do: both have to update the model character by character, continuously obtaining a probability distribution over the next character.\n", "The only difference is that `generate.py` uses the obtained probability distribution to sample from it while we'll use it to define an entropy model.\n", "\n", "The function `generate` in `char-rnn.pytorch/generate.py` shows us how to do all the steps we need.\n", "The function takes a model as argument, which is—confusingly—called `decoder`.\n", "The function body starts by initializing a hidden state as follows,\n", "\n", "```python\n", " hidden = decoder.init_hidden(1)\n", "```\n", "\n", "After some more initializations, the function enters a loop that iteratively updates the model as follows:\n", "\n", "```python\n", " output, hidden = decoder(inp, hidden)\n", "```\n", "\n", "where `inp` is the \"input\", i.e., the previous character, represented (for technical reasons) as an integer PyTorch tensor of length `1`.\n", "The above model update returns `output` and the updated `hidden` state.\n", "On the next line of code, `output` gets scaled by some temperature and exponentiated before it is interpreted as an (unnormalized) probability distribution by being passed into `torch.multinomial`.\n", "Thus, `output` seems to be a tensor of logits, defining the probability distribution for the next character.\n", "Finally, after drawing a sample `top_i` from the `torch.multinomial` distribution, the function maps the sampled integer to a character as follows:\n", "\n", "```python\n", " predicted_char = all_characters[top_i]\n", "```\n", "\n", "Thus, there seems to be some fixed string `all_characters` that has the character with integer representation `i` at its `i`'th position.\n", "Let's bring this string into scope (and while we're at it, let's also import some other stuff we'll need below):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import constriction\n", "import torch\n", "import numpy as np\n", "from tqdm import tqdm\n", "\n", "# Add subdirectory `char-rnn.pytorch` to Python's path so we can import some stuff from it.\n", "import sys\n", "import os\n", "sys.path.append(os.path.join(os.getcwd(), \"char-rnn.pytorch\"))\n", "from model import *\n", "from helpers import read_file, all_characters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try it out:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n", "\u000b\f\n" ] } ], "source": [ "print(all_characters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implement The Encoder\n", "\n", "We now have everything we need to know to implement the encoder and decoder.\n", "Let's start with the encoder, and define a function `compress_file` that takes the path to a file, compresses it, and writes the output to a different file.\n", "Since our compression method will turn out to be very slow (see introduction), we'll also introduce an optional argument `max_chars` that will allow the caller to compress only the first `max_chars` characters from the input file and stop after that.\n", "The function `compress_file` will also expect a `model` argument where the caller will have to pass in the trained model (this was called `decoder` in the file `generate.py` discussed above, but we'll call it `model` here to avoid confusion)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def compress_file(model, in_filename, out_filename, max_chars=None):\n", " message, _ = read_file(in_filename) # (`read_file` defined in `char-rnn.pytorch/helpers.py`)\n", " if max_chars is not None:\n", " message = message[:max_chars] # Truncate message to at most `max_chars` characters.\n", "\n", " # Initialize the hidden state and model input as discussed above:\n", " hidden = model.init_hidden(1) # (same as in `generate.py` discussed above)\n", " input_char = torch.tensor([all_characters.index('\\n')], dtype=torch.int64) # \"fake\" character that we pretend precedes the message\n", "\n", " # Instantiate an empty Range Coder onto which we'll accumulate compressed data:\n", " encoder = constriction.stream.queue.RangeEncoder()\n", "\n", " # Iterate over the message and encode it character by character, updating the model as we go:\n", " for char in tqdm(message):\n", " output, hidden = model(input_char, hidden) # update the model (as in `generate.py`)\n", "\n", " # Turn the `output` into an entropy model and encode the character with it:\n", " logits = output.data.view(-1)\n", " logits = logits - logits.max() # \"Log-Sum-Exp trick\" for numerical stability\n", " unnormalized_probs = logits.exp().numpy().astype(np.float64)\n", " entropy_model = constriction.stream.model.Categorical(unnormalized_probs)\n", " char_index = all_characters.index(char)\n", " encoder.encode(char_index, entropy_model)\n", "\n", " # Prepare for next model update:\n", " input_char[0] = char_index\n", "\n", " # Save the compressed data to a file\n", " print(f\"Compressed {len(message)} characters into {encoder.num_bits()} bits ({encoder.num_bits() / len(message):.2f} bits per character).\")\n", " compressed = encoder.get_compressed()\n", " if sys.byteorder != \"little\":\n", " # Let's always save data in the same byte order so compressed files can be transferred across computer architectures.\n", " compressed.byteswap(inplace=True)\n", " compressed.tofile(out_filename)\n", " print(f'Wrote compressed data to file \"{out_filename}\".')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main part that distinguishes our function `compress_file` from the function `generate` in `generate.py` discussed above is that, after each model update, we use the model to encode an (already given) character rather than to sample a character.\n", "There are two slightly subtle steps here:\n", "first, we subtract the constant `logits.max()` from all elements of `logits` before exponentiating them.\n", "Such a global shift in logit-space has no effect (apart from rounding errors) on the resulting probability distribution since it will correspond to a global scaling factor after exponentiation.\n", "We perform this operation out of an abundance of caution to prevent numerical overflow when we exponentiate `logits` on the next line.\n", "Second, we construct the `Categorical` entropy model from a tensor of *unnormalized* probabilities.\n", "That's OK according to [the documentation](https://bamler-lab.github.io/constriction/apidoc/python/stream/model.html#constriction.stream.model.Categorical), `constriction` will have to make sure the distribution is exactly normalized in fixed-point arithmetic anyway.\n", "\n", "### Implement The Decoder\n", "\n", "Let's also implement a function `decompress_file` that recovers the message from its compressed representation so that we can prove that the encoder did not discard any information.\n", "The decoder operates very similar to the encoder, except that it starts by loading compressed data from a file instead of the message, and it initializes a `RangeDecoder` from it, from which it then decodes one symbol at a time in the iteration.\n", "One important difference to the encoder is that, with our current autoregressive model, the decoder cannot detect the end of the message.\n", "Therefore, we have to provide the message length (`num_chars`) as an argument to the decoder function.\n", "In a real deployment, you'll probably want to either transmit the message length as an explicit initial symbol, or you could add an \"End of File\" sentinel symbol to the alphabet (`all_characters`) and append this symbol to the message on the encoder side to signal to the decoder that it should stop processing." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def decompress_file(model, in_filename, out_filename, num_chars):\n", " # Load the compressed data into a `RangeDecoder`:`\n", " compressed = np.fromfile(in_filename, dtype=np.uint32)\n", " if sys.byteorder != \"little\":\n", " compressed.byteswap(inplace=True) # restores native byte order (\"endianness\").\n", " print(f\"Loaded {32 * len(compressed)} bits of compressed data.\")\n", " decoder = constriction.stream.queue.RangeDecoder(compressed)\n", "\n", " # Initialize the hidden state and model input exactly as in the encoder:\n", " hidden = model.init_hidden(1) # (same as in `generate.py` discussed above)\n", " input_char = torch.tensor([all_characters.index('\\n')], dtype=torch.int64) # \"fake\" character that we pretend precedes the message\n", "\n", " # Decode the message character by character, updating the model as we go:\n", " with open(out_filename, \"w\") as out_file:\n", " for _ in tqdm(range(num_chars)):\n", " # Update model and obtain (unnormalized) probabilities, exactly as in the encoder:\n", " output, hidden = model(input_char, hidden)\n", " logits = output.data.view(-1)\n", " logits = logits - logits.max()\n", " unnormalized_probs = logits.exp().numpy().astype(np.float64)\n", " entropy_model = constriction.stream.model.Categorical(unnormalized_probs)\n", " \n", " # This time, use the `entropy_model` for *decoding* to obtain the next character:\n", " char_index = decoder.decode(entropy_model)\n", " char = all_characters[char_index]\n", " out_file.write(char)\n", "\n", " # Prepare for next model update, exactly as in the encoder:\n", " input_char[0] = char_index\n", "\n", " print(f'Wrote decompressed data to file \"{out_filename}\".')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Try It Out\n", "\n", "If you've followed along and taken the time to understand the encoder/decoder implementation in Step 4 above then the model should have finished training by now.\n", "Load it into memory:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "model = torch.load(\"shakespeare_train.pt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, try out if our implementation can indeed compress and decompress a text file with this model.\n", "We'll compress the *test* subset of our data set so that we test our method on text that was not used for training (albeit, admittedly, the test data is very similar to the training data since both were written by the same author):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10000/10000 [00:08<00:00, 1205.50it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Compressed 10000 characters into 20640 bits (2.06 bits per character).\n", "Wrote compressed data to file \"shakespeare_test.txt.compressed\".\n" ] } ], "source": [ "compress_file(model, \"shakespeare_test.txt\", \"shakespeare_test.txt.compressed\", max_chars=10_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you didn't change anything in the training schedule then you should get a bitrate of about 2.1 bits per character.\n", "Before we compare this bitrate to that of general-purpose compression methods, let's first verify that the compression method is actually correct.\n", "Decode the compressed file again:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 20640 bits of compressed data.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10000/10000 [00:07<00:00, 1275.40it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Wrote decompressed data to file \"shakespeare_test.txt.decompressed\".\n" ] } ], "source": [ "decompress_file(model, \"shakespeare_test.txt.compressed\", \"shakespeare_test.txt.decompressed\", 10_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick peak in the original and reconstructed text:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==> shakespeare_test.txt <==\n", "Before we proceed any further, hear me speak.\n", "All:\n", "First Citizen:\n", "We are accounted poor citizens, the patricians good.\n", "would yield us but the superfluity, while it were\n", "but they think we are too dear: the leanness that\n", "speak this in hunger for bread, not in thirst for revenge.\n", "report fort, but that he pays himself with being proud.\n", "First Citizen:\n", "Come, come.\n", "\n", "==> shakespeare_test.txt.decompressed <==\n", "Before we proceed any further, hear me speak.\n", "All:\n", "First Citizen:\n", "We are accounted poor citizens, the patricians good.\n", "would yield us but the superfluity, while it were\n", "but they think we are too dear: the leanness that\n", "speak this in hunger for bread, not in thirst for revenge.\n", "report fort, but that he pays himself with being proud.\n", "First Citizen:\n", "Come, come.\n" ] } ], "source": [ "!head shakespeare_test.txt shakespeare_test.txt.decompressed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The beginnings certainly look similar.\n", "But let's check more thoroughly.\n", "Remember that we only encoded and decoded the first 10,000 characters of the test data, so that's what we have to compare to (note: the test turns out to be pure ASCII, so the first 10,000 characters map exactly to the first 10,000 bytes):\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "!head -c 10000 shakespeare_test.txt | diff - shakespeare_test.txt.decompressed # If this prints no output we're good." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Evaluation\n", "\n", "There's a lot we could analyze now:\n", "\n", "- How do the bitrates of our method compare to general-purpose compression methods like `gzip`, `bzip2`, and `xz`?\n", "- How well does our method generalize to other text, ranging from other English text by a different author all the way to text in a different language?\n", "- Where do the encoder and decoder spend most of their runtime?\n", "\n", "We'll just address the first question here and leave the others to the reader.\n", "Let's compress the same first 10,000 characters of the test data with `gzip`, `bzip2`, and `xz` (if installed on your system):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "!head -c 10000 shakespeare_test.txt | gzip --best > shakespeare_test.txt.gzip\n", "!head -c 10000 shakespeare_test.txt | bzip2 --best > shakespeare_test.txt.bz2\n", "!head -c 10000 shakespeare_test.txt | xz --best > shakespeare_test.txt.xz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then let's compare the sizes of the compressed files:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 robamler robamler 4302 Jan 8 21:51 shakespeare_test.txt.bz2\n", "-rw-rw-r-- 1 robamler robamler 2580 Jan 8 21:50 shakespeare_test.txt.compressed\n", "-rw-rw-r-- 1 robamler robamler 10000 Jan 8 21:50 shakespeare_test.txt.decompressed\n", "-rw-rw-r-- 1 robamler robamler 4814 Jan 8 21:51 shakespeare_test.txt.gzip\n", "-rw-rw-r-- 1 robamler robamler 4788 Jan 8 21:51 shakespeare_test.txt.xz\n" ] } ], "source": [ "!ls -l shakespeare_test.txt.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Despite using a very simple model that we took from a tutorial completely unrelated to data compression, our compression method reduces the bitrate compared to `bzip2` by 40%.\n", "Of course, we shouldn't read too much into this since we trained the model on data that is very similar to the test data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus 1: Getting It *Almost* Right and Yet Fatally Wrong\n", "\n", "The [API reference for `constriction`'s entropy models](https://bamler-lab.github.io/constriction/apidoc/python/stream/model.html) highlights that entropy models are brittle: event tiny discrepancies in rounding operations between encoder and decoder can have catastrophic effects for entropy coding.\n", "The models provided by `constriction` are implemented in exact fixed point arithmetic to allow for well-defined and consistent rounding operations when, e.g., inverting the CDF.\n", "However, `constriction` can only guarantee consistent rounding operations in its internal operations.\n", "You have to ensure yourself that any probabilities you provide to `constriction` are *exactly* the same on the encoder and decoder side.\n", "\n", "The following example illustrates how even tiny discrepancies between rounding operations on the encoder and decoder side can completely derail the entropy coder.\n", "Recall that, in both functions `compress_file` and `decompress_file` above, we subtract `logits.max()` from `logits` before we exponentiate them.\n", "This can prevent numerical overflow in the exponentiation but one might expect that it has otherwise no effect on the resulting probability distribution since it only leads to a global scaling of `unnormalized_probs`, which should drop out once `constriction` normalizes the probabilities—*except* that this is not quite correct:\n", "even if there's no numerical overflow, the different scaling affects all subsequent rounding operations that are implicitly performed by the CPU in any floating point operation.\n", "In and of itself, these implicit rounding operations are not a big issue and unavoidable in floating point calculations.\n", "However, they do become an issue when they are done inconsistently between the encoder and the decoder, as we show in the next example.\n", "\n", "Let's keep the encoder as it is, but let's slightly modify the decoder by commenting out the line that subtracts `logits.max()` from `logits`, as highlighted by the string `<--- COMMENTED OUT` in the following example:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def decompress_file_almost_right(model, in_filename, out_filename, num_chars):\n", " # Load the compressed data into a `RangeDecoder`:`\n", " compressed = np.fromfile(in_filename, dtype=np.uint32)\n", " if sys.byteorder != \"little\":\n", " compressed.byteswap(inplace=True) # restores native byte order (\"endianness\").\n", " print(f\"Loaded {32 * len(compressed)} bits of compressed data.\")\n", " decoder = constriction.stream.queue.RangeDecoder(compressed)\n", "\n", " # Initialize the hidden state and model input exactly as in the encoder:\n", " hidden = model.init_hidden(1) # (same as in `generate.py` discussed above)\n", " input_char = torch.tensor([all_characters.index('\\n')], dtype=torch.int64) # \"fake\" character that we pretend precedes the message\n", "\n", " # Decode the message character by character, updating the model as we go:\n", " with open(out_filename, \"w\") as out_file:\n", " for _ in tqdm(range(num_chars)):\n", " # Update model and optain (unnormalized) probabilities, exactly as in the encoder:\n", " output, hidden = model(input_char, hidden)\n", " logits = output.data.view(-1)\n", " # logits = logits - logits.max() <--- COMMENTED OUT\n", " unnormalized_probs = logits.exp().numpy().astype(np.float64)\n", " entropy_model = constriction.stream.model.Categorical(unnormalized_probs)\n", " \n", " # This time, use the `entropy_model` for *decoding* to obtain the next character:\n", " char_index = decoder.decode(entropy_model)\n", " char = all_characters[char_index]\n", " out_file.write(char)\n", "\n", " # Prepare for next model update, exactly as in the encoder:\n", " input_char[0] = char_index\n", "\n", " print(f'Wrote decompressed data to file \"{out_filename}\".')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we use this slightly modified decoder to decode data that was encoded with the original encoder `compress_file`, we *might* run into an issue.\n", "\n", "**Note:** The following example may or may not work on your setup, depending on the random seeds used for training the model.\n", "But if it fails, as in my setup below, then it will fail catastrophically and either throw an error (as it does here) or (if we're not quite as lucky) silently continue decoding but decode complete gibberish after some point in the stream." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 20640 bits of compressed data.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 9%|▉ | 884/10000 [00:00<00:08, 1110.52it/s]\n" ] }, { "ename": "AssertionError", "evalue": "Tried to decode from compressed data that is invalid for the employed entropy model.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_124562/1285329760.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdecompress_file_almost_right\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"shakespeare_test.txt.compressed\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"shakespeare_test.txt.decompressed_wrong\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m10_000\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/tmp/ipykernel_124562/2534290001.py\u001b[0m in \u001b[0;36mdecompress_file_almost_right\u001b[0;34m(model, in_filename, out_filename, num_chars)\u001b[0m\n\u001b[1;32m 22\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;31m# This time, use the `entropy_model` for *decoding* to obtain the next character:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 24\u001b[0;31m \u001b[0mchar_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdecoder\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mentropy_model\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 25\u001b[0m \u001b[0mchar\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mall_characters\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mchar_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 26\u001b[0m \u001b[0mout_file\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mchar\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAssertionError\u001b[0m: Tried to decode from compressed data that is invalid for the employed entropy model." ] } ], "source": [ "decompress_file_almost_right(model, \"shakespeare_test.txt.compressed\", \"shakespeare_test.txt.decompressed_wrong\", 10_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we were able to decode the first 883 characters just fine (you might get a different number here).\n", "Issues due to tiny discrepancies in rounding operations are very unlikely to occur, but when they occur in an entropy coder, they are fatal.\n", "We actually got lucky here: it's also possible that the decoder does not detect any errors but that it starts decoding complete gibberish after some point (in fact, had we used an ANS coder instead of a Range Coder, then decoding would have been infallible but could still produce wrong results when misused).\n", "\n", "Due to this brittleness, entropy models have to be implemented with care, and we consider `constriction`'s implementations of entropy models an important part of the library, in addition to `constriction`'s entropy coders.\n", "Yet, as the above example shows, even the most careful implementation of an entropy model cannot protect from errors when misused." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus 2: Huffman Coding\n", "\n", "Our above compression method uses Range Coding for the entropy coder.\n", "The `constriction` library also provides another entropy coder with \"queue\" semantics: Huffman coding.\n", "The API for Huffman coding is somewhat different to that of Range Coding because of the very different nature of the two algorithms, but it's easy to port our encoder and decoder to Huffman coding:\n", "\n", "### Encoder" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def compress_file_huffman(model, in_filename, out_filename, max_chars=None):\n", " message, _ = read_file(in_filename) # (`read_file` defined in `char-rnn.pytorch/helpers.py`)\n", " if max_chars is not None:\n", " message = message[:max_chars] # Truncate message to at most `max_chars` characters.\n", "\n", " # Initialize the hidden state and model input as discussed above:\n", " hidden = model.init_hidden(1) # (same as in `generate.py` discussed above)\n", " input_char = torch.tensor([all_characters.index('\\n')], dtype=torch.int64) # \"fake\" character that we pretend precedes the message\n", "\n", " # Instantiate an empty `QueueEncoder` onto which we'll accumulate compressed data:\n", " encoder = constriction.symbol.QueueEncoder() # <-- CHANGED LINE\n", "\n", " # Iterate over the message and encode it character by character, updating the model as we go:\n", " for char in tqdm(message):\n", " output, hidden = model(input_char, hidden) # update the model (as in `generate.py`)\n", "\n", " # Turn the `output` into an entropy model and encode the character with it:\n", " logits = output.data.view(-1)\n", " logits = logits - logits.max() # \"Log-Sum-Exp trick\" for numerical stability\n", " unnormalized_probs = logits.exp().numpy().astype(np.float64)\n", " codebook = constriction.symbol.huffman.EncoderHuffmanTree(unnormalized_probs) # <-- CHANGED LINE\n", " char_index = all_characters.index(char)\n", " encoder.encode_symbol(char_index, codebook) # <-- CHANGED LINE\n", "\n", " # Prepare for next model update:\n", " input_char[0] = char_index\n", "\n", " # Save the compressed data to a file\n", " compressed, num_bits = encoder.get_compressed() # <-- CHANGED LINE\n", " print(f\"Compressed {len(message)} characters into {num_bits} bits ({num_bits / len(message):.2f} bits per character).\")\n", " if sys.byteorder != \"little\":\n", " # Let's always save data in the same byte order so compressed files can be transferred across computer architectures.\n", " compressed.byteswap(inplace=True)\n", " compressed.tofile(out_filename)\n", " print(f'Wrote compressed data to file \"{out_filename}\".')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decoder" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def decompress_file_huffman(model, in_filename, out_filename, num_chars):\n", " # Load the compressed data into a `RangeDecoder`:`\n", " compressed = np.fromfile(in_filename, dtype=np.uint32)\n", " if sys.byteorder != \"little\":\n", " compressed.byteswap(inplace=True) # restores native byte order (\"endianness\").\n", " print(f\"Loaded {32 * len(compressed)} bits of compressed data.\")\n", " decoder = constriction.symbol.QueueDecoder(compressed) # <-- CHANGED LINE\n", "\n", " # Initialize the hidden state and model input exactly as in the encoder:\n", " hidden = model.init_hidden(1) # (same as in `generate.py` discussed above)\n", " input_char = torch.tensor([all_characters.index('\\n')], dtype=torch.int64) # \"fake\" character that we pretend precedes the message\n", "\n", " # Decode the message character by character, updating the model as we go:\n", " with open(out_filename, \"w\") as out_file:\n", " for _ in tqdm(range(num_chars)):\n", " # Update model and optain (unnormalized) probabilities, exactly as in the encoder:\n", " output, hidden = model(input_char, hidden)\n", " logits = output.data.view(-1)\n", " logits = logits - logits.max()\n", " unnormalized_probs = logits.exp().numpy().astype(np.float64)\n", " codebook = constriction.symbol.huffman.DecoderHuffmanTree(unnormalized_probs) # <-- CHANGED LINE\n", " \n", " # This time, use the `codebook` for *decoding* to obtain the next character:\n", " char_index = decoder.decode_symbol(codebook) # <-- CHANGED LINE\n", " char = all_characters[char_index]\n", " out_file.write(char)\n", "\n", " # Prepare for next model update, exactly as in the encoder:\n", " input_char[0] = char_index\n", "\n", " print(f'Wrote decompressed data to file \"{out_filename}\".')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Try it Out Again" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10000/10000 [00:05<00:00, 1689.45it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Compressed 10000 characters into 23278 bits (2.33 bits per character).\n", "Wrote compressed data to file \"shakespeare_test.txt.compressed-huffman\".\n" ] } ], "source": [ "compress_file_huffman(model, \"shakespeare_test.txt\", \"shakespeare_test.txt.compressed-huffman\", max_chars=10_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparison: when we used a Range Coder we got a better bitrate of only about 2.1 bits per character." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 23296 bits of compressed data.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10000/10000 [00:07<00:00, 1358.89it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Wrote decompressed data to file \"shakespeare_test.txt.decompressed-huffman\".\n" ] } ], "source": [ "decompress_file_huffman(model, \"shakespeare_test.txt.compressed-huffman\", \"shakespeare_test.txt.decompressed-huffman\", 10_000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify correctness again:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "!head -c 10000 shakespeare_test.txt | diff - shakespeare_test.txt.decompressed-huffman # If this prints no output we're good." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "We've discussed how you can use `constriction`'s entropy coders with an entropy model that is an autoregressive machine-learning model.\n", "Autoregressive models allow you to model correlations between symbols, and exploit them to improve compression performance.\n", "An alternative method for exploiting correlations in data compression is the so-called bits-back technique, which applies to latent variable models that tend to be better parallelizable than autoregressive models.\n", "An example of bits-back coding with `constriction` is provided in [this problem set](https://robamler.github.io/teaching/compress21/problem-set-05.zip) (with [solutions](https://robamler.github.io/teaching/compress21/problem-set-05-solutions.zip))." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0rc2 (main, Sep 12 2022, 16:20:24) [GCC 12.2.0]" }, "vscode": { "interpreter": { "hash": "ead1b95f633dc9c51826328e1846203f51a198c6fb5f2884a80417ba131d4e82" } } }, "nbformat": 4, "nbformat_minor": 2 }