Htscodecs
=========

See the NEWS file for a list of updates and version details.

[![Build Status](https://api.cirrus-ci.com/github/jkbonfield/htscodecs.svg?branch=master)](https://cirrus-ci.com/github/jkbonfield/htscodecs)

This repository implements the custom CRAM codecs used for "EXTERNAL"
block types.  These consist of two variants of the rANS codec (8-bit
and 16-bit renormalisation, with run-length encoding and bit-packing
also supported in the latter), a dynamic arithmetic coder, and custom
codecs for name/ID compression and quality score compression derived
from fqzcomp.


They come with small command line test tools to act as both
compression exploration programs and as part of the test harness.


Building
--------

If building from git, you'll need to recreate the configure script
using autoconf.  "autoreconf -i" should work if you have the
appropriate tools.

From then on, it follows the normal "./configure; make" or
"mkdir build; cd build; ../configure; make" rule.

The library can be used as a git sub-module or as a completely
separate entity.  If you are attempting to make use of these codecs
within your own library, such as we do within Staden io_lib, it may be
useful to configure this with `--disable-shared --with-pic'.


Testing
-------

There is a "make check" rule.  If you're using a modern clang you can
also cd to the tests directory and do "make fuzz" to build some fuzz
testing targets, but you'll likely need to modify Makefile.am first as
this has some hard-coded local paths.

We also provide test data and some command line tools to demonstrate
usage of the compression codecs.  These are in the tests directory
also. Example usage:

    ./fqzcomp_qual -s 1 < dat/q40+dir > /tmp/q40.comp
    ./fqzcomp_qual  -d < /tmp/q40.comp > /tmp/q40.uncomp
    awk '{print $1}' dat/q40+dir | md5sum;    # f91473032dd6942e72abec0868f17161
    awk '{print $1}' /tmp/q40.uncomp | md5sum;# f91473032dd6942e72abec0868f17161

The fqzcomp test format is one quality values per line, with an
optional additional parameter (0 or 1) to indicate READ1 or READ2 flag
status.

There is a larger set of test data in the htscodecs-corpus repository
(https://github.com/jkbonfield/htscodecs-corpus).  If this is cloned
into the tests subdirectory of htscodecs then the htscodecs "make
check" will also use that larger data set for testing purposes.


API
---

Many functions just take an input buffer and size and return an output
buffer, setting *out_size with the decoded size.  NULL is returned for
error.  This buffer is malloced and is expected to be freed by the
caller.  These are the *`compress` and *`uncompress` functions.

A second variant sometimes exists where the output buffer is
optionally allocated by the caller (it may be NULL in which case it
has the same operation as above).  If specified, `*out_size` must also
be set to the allocated size of `out`.  These are the `compress_to`
and `uncompress_to` functions.

The compress size sometimes needs additional options.  For the rANS
and arithmetic coder this is the "order".  Values of 0 and 1 are
simple order-0 and order-1 entropy encoder, but this is a bit field
and the more advanced codecs have additional options to pass in order
(so it should really be renamed to flags).  See below.  Fqzcomp
requires more input data - also see below.  In all cases, sufficient
information is stored in the compressed byte stream such that the
decompression will work without needing these input paramaters.

Finally the various `compress_bound` functions give the size of buffer
needed to be allocated when compressing a block of data.


### Static rANS 4x8 (introduced in CRAM v3.0)

```
#include "htscodecs/rANS_static.h"

unsigned char *rans_compress(unsigned char *in, unsigned int in_size,
                             unsigned int *out_size, int order);
unsigned char *rans_uncompress(unsigned char *in, unsigned int in_size,
                               unsigned int *out_size);
```

No (un)compress_to functions exist for this older codec.


### Static rANS 4x16 with bit-pack/RLE (CRAM v3.1):

```
#include "htscodecs/rANS_static4x16.h"

unsigned int rans_compress_bound_4x16(unsigned int size, int order);
unsigned char *rans_compress_to_4x16(unsigned char *in,  unsigned int in_size,
                                     unsigned char *out, unsigned int *out_size,
                                     int order);
unsigned char *rans_compress_4x16(unsigned char *in, unsigned int in_size,
                                  unsigned int *out_size, int order);
unsigned char *rans_uncompress_to_4x16(unsigned char *in,  unsigned int in_size,
                                       unsigned char *out, unsigned int *out_size);
unsigned char *rans_uncompress_4x16(unsigned char *in, unsigned int in_size,
                                    unsigned int *out_size);
```

### Adaptive arithmetic coding (CRAM v3.1):

```
#include "htscodecs/arith_dynamic.h"

unsigned char *arith_compress(unsigned char *in, unsigned int in_size,
                              unsigned int *out_size, int order);

unsigned char *arith_uncompress(unsigned char *in, unsigned int in_size,
                                unsigned int *out_size);

unsigned char *arith_compress_to(unsigned char *in,  unsigned int in_size,
                                 unsigned char *out, unsigned int *out_size,
                                 int order);

unsigned char *arith_uncompress_to(unsigned char *in, unsigned int in_size,
                                   unsigned char *out, unsigned int *out_sz);

unsigned int arith_compress_bound(unsigned int size, int order);
```

### Name tokeniser (CRAM v3.1):

```
#include "htscodecs/tokenise_name3.h"

uint8_t *encode_names(char *blk, int len, int level, int use_arith,
                      int *out_len, int *last_start_p);

uint8_t *decode_names(uint8_t *in, uint32_t sz, uint32_t *out_len);
```

This differs to the general purpose entropy encoders as it takes a
specific type of data.  The names should be newline or nul separated
for `encode_names`.  `decode_names` will alway return nul terminated
names, so you may need to swap these to newlines if you do round-trip
tests.

The compression level controls how hard it tries to find the optimum
compression method per internal token column.  By default it'll use
the rANS 4x16 codec, but with non-zero `use_arith` it'll use the
adaptive arithmetic coder instead.

If non-NULL, last_start_p can be used to point to a partial name if an
arbitrary block of names were supplied that don't end of a whole read
name. (Is this useful?  Probably not.)


### FQZComp Qual (CRAM v3.1):


```
#include "htscodecs/fqzcomp_qual.h"

#define FQZ_FREVERSE 16
#define FQZ_FREAD2 128

typedef struct {
    int num_records;
    uint32_t *len;    // of size num_records
    uint32_t *flags;  // of size num_records
} fqz_slice;

char *fqz_compress(int vers, fqz_slice *s, char *in, size_t uncomp_size,
                   size_t *comp_size, int strat, fqz_gparams *gp);
char *fqz_decompress(char *in, size_t comp_size, size_t *uncomp_size,
                     int *lengths, int nlengths);
```

This is derived from the quality compression in fqzcomp.  The input
buffer is a concatenated block of quality strings, without any
separator.  In order to achieve maximum compression it needs to know
where these separators are, so they must be passed in via the
`fqz_slice` struct.

The summation of length fields should match the input uncomp_size
field.  Note the len fields may not actually be the length of the
original sequences as some CRAM features may additional quality values
(eg the "B" feature).

It can also be beneficial to supply per-record flags so fqzcomp can
determine whether orientation (complement strand) helps and whether
the READ1 vs READ2 quality distributions differ.  These are just
sub-fields from BAM FLAG.

The fqz_gparams will normally be passed in as NULL and the encoder
will automatically select parameters.  If you wish to fine tune the
compression methods, see the fqz_params and fqz_gparams structures in
the header file.  You may also find the fqz_qual_stats() utility
function helpful for gathering statistics on your quality values.

For decompression, the lengths array is optional and may be specified
as NULL.  If passed in, it must be of size nlengths and it will be
filled out with the decoded length of each quality string.  Note
regardless of whether lengths is NULL or not, the buffer returned will
be concatenated values so there is no way to tell where one record
finishes and the next starts.  (CRAM itself knows this via other means.)