![Publish](https://github.com/sstadick/perbase/workflows/Publish/badge.svg)
![Rust](https://github.com/sstadick/perbase/workflows/Rust/badge.svg)
[![API docs](https://img.shields.io/badge/API-documentation-blue.svg)](https://docs.rs/perbase)
[![Crates.io](https://img.shields.io/crates/v/perbase.svg)](https://crates.io/crates/perbase)
[![Conda](https://anaconda.org/anaconda/anaconda/badges/installer/conda.svg)](https://anaconda.org/bioconda/perbase)
A highly parallelized utility for analyzing metrics at a per-base level.
If a metric is missing, or performance is lacking. Please file a bug/feature ticket in issues.
## Why?
Why `perbase` when so many other tools are out there? `perbase` leverages Rust's concurrency system to automagically parallelize over your input regions. This leads to orders of magnitude faster runtimes that scale with the compute resources that you have available. Additionally, `perbase` aims to be more accurate than other tools. E.g.: `perbase` counts DELs toward depth, `bam-readcount` does not, `perbase` does not count REF_SKIPs toward depth, `sambamba` does.
## Installation
```bash
conda install -c bioconda perbase
# OR
cargo install perbase
```
You can also download a binary from the [releases](https://github.com/sstadick/perbase/releases) page.
## Tools
### base-depth
The `base-depth` tool walks over every position in the BAM/CRAM file and calculates the depth, as well as the number of each nucleotide at the given position. Additionally, it counts the numbers of Ins/Dels at each position.
The output columns are as follows:
| Column | Description |
| -------------- | -------------------------------------------------------------------------------------------------- |
| REF | The reference sequence name |
| POS | The position on the reference sequence |
| REF_BASE | The reference base at the position, column excluded if no reference was supplied |
| DEPTH | The total depth at the position SUM(A, C, T, G, DEL) |
| A | Total A nucleotides seen at this position |
| C | Total C nucleotides seen at this position |
| G | Total G nucleotides seen at this position |
| T | Total T nucleotides seen at this position |
| N | Total N nucleotides seen at this position |
| INS | Total insertions that start at the base to the right of this position |
| DEL | Total deletions covering this position |
| REF_SKIP | Total reference skip operations covering this position |
| FAIL | Total reads failing filters that covered this position (their bases were not counted toward depth) |
| NEAR_MAX_DEPTH | Flag to indicate if this position came within 1% of the max depth specified |
```bash
perbase base-depth ./test/test.bam
```
Example output
```text
REF POS REF_BASE DEPTH A C G T N INS DEL REF_SKIP FAIL NEAR_MAX_DEPTH
chr1 709636 T 16 0 0 0 16 0 0 0 0 0 false
chr1 709637 T 16 0 4 0 12 0 0 0 0 0 false
chr1 709638 A 16 16 0 0 0 0 0 0 0 0 false
chr1 709639 G 16 0 0 16 0 0 0 0 0 0 false
chr1 709640 A 16 16 0 0 0 0 0 0 0 0 false
chr1 709641 A 16 16 0 0 0 0 0 0 0 0 false
chr1 709642 G 16 0 0 16 0 0 0 0 0 0 false
chr1 709643 G 16 0 0 16 0 0 0 0 0 0 false
chr1 709644 T 16 0 0 0 16 0 0 0 0 0 false
chr1 709645 G 16 0 0 16 0 0 0 0 0 0 false
```
If the `--mate-fix` flag is passed, each position will first check if there are any mate overlaps and choose the mate with the hightest MAPQ, breaking ties by choosing the first mate that passes filters. Mates that are discarded are not counted toward `FAIL` or `DEPTH`.
If the `--reference-fasta` is supplied, the `REF_BASE` field will be filled in. The reference must be indexed an match the BAM/CRAM header of the input.
The output can be compressed and indexed as follows:
```bash
perbase base-depth -Z ./test/test.bam -o output.tsv.gz
tabix -S 1 -s 1 -b 2 -e 2 ./output.tsv.gz
# Query all positions overlapping region
tabix output.tsv.gz chr1:5-10
```
Usage:
```text
Calculate the depth at each base, per-nucleotide
USAGE:
perbase base-depth [FLAGS] [OPTIONS]
FLAGS:
-Z, --bgzip
Optionally bgzip the output
-h, --help
Prints help information
-k, --keep-zeros
Keep positions even if they have 0 depth
-m, --mate-fix
Fix overlapping mates counts, see docs for full details
-M, --skip-merging-intervals
Skip mergeing togther regions specified in the optional BED or BCF/VCF files.
**NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This
may cause issues with downstream tooling.
-V, --version
Prints version information
-z, --zero-base
Output positions as 0-based instead of 1-based
OPTIONS:
-B, --bcf-file
A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be
reported on
-b, --bed-file
A BED file containing regions of interest. If specified, only bases from the given regions will be reported
on
-C, --channel-size-modifier
The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default:
0.15]
-c, --chunksize
The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) *
chunksize [default: 1000000]
-L, --compression-level
The level to use for compressing output (specified by --bgzip) [default: 2]
-T, --compression-threads
The number of threads to use for compressing output (specified by --bgzip) [default: 4]
-F, --exclude-flags
SAM flags to exclude, recommended 3848 [default: 0]
-f, --include-flags
SAM flags to include [default: 0]
-D, --max-depth
Set the max depth for a pileup. If a positions depth is within 1% of max-depth the `NEAR_MAX_DEPTH` output
field will be set to true and that position should be viewed as suspect [default: 100000]
-Q, --min-base-quality-score
Minium base quality for a base to be counted toward [A, C, T, G]. If the base is less than the specified
quality score it will instead be counted as an `N`. If nothing is set for this no cutoff will be applied
-q, --min-mapq
Minimum MAPQ for a read to count toward depth [default: 0]
-o, --output