perbase

Crates.io	perbase
lib.rs	perbase
version	1.2.0
created_at	2020-09-28 14:16:11.470257+00
updated_at	2025-12-11 17:29:47.351595+00
description	Fast and correct perbase BAM/CRAM analysis.
homepage
repository	https://github.com/sstadick/perbase
max_upload_size
id	293682
size	1,207,557

Seth (sstadick)

documentation

https://docs.rs/perbase

README

A highly parallelized utility for analyzing metrics at a per-base level.

If a metric is missing, or performance is lacking. Please file a bug/feature ticket in issues.

Why?

Why perbase when so many other tools are out there? perbase leverages Rust's concurrency system to automagically parallelize over your input regions. This leads to orders of magnitude faster runtimes that scale with the compute resources that you have available. Additionally, perbase aims to be more accurate than other tools. E.g.: perbase counts DELs toward depth, bam-readcount does not, perbase does not count REF_SKIPs toward depth, sambamba does.

Installation

conda install -c bioconda perbase
# OR
cargo install perbase
# OR
brew install perbase

You can also download a binary from the releases page.

Tools

base-depth

The base-depth tool walks over every position in the BAM/CRAM file and calculates the depth, as well as the number of each nucleotide at the given position. Additionally, it counts the numbers of Ins/Dels at each position.

Note on empty SEQ records: BAM records with empty SEQ fields (represented as * in SAM format) are handled gracefully. These reads still count toward depth, but since the actual nucleotide cannot be determined, they are counted as N.

The output columns are as follows:

Column	Description
REF	The reference sequence name
POS	The position on the reference sequence
REF_BASE	The reference base at the position, column excluded if no reference was supplied
DEPTH	The total depth at the position SUM(A, C, T, G, N, R, Y, S, W, K, M, DEL)
A	Total A nucleotides seen at this position
C	Total C nucleotides seen at this position
G	Total G nucleotides seen at this position
T	Total T nucleotides seen at this position
N	Total N nucleotides seen at this position
R	Total R nucleotides seen at this position
Y	Total Y nucleotides seen at this position
S	Total S nucleotides seen at this position
W	Total W nucleotides seen at this position
K	Total K nucleotides seen at this position
M	Total M nucleotides seen at this position
INS	Total insertions that start at the base to the right of this position
DEL	Total deletions covering this position
REF_SKIP	Total reference skip operations covering this position
FAIL	Total reads failing filters that covered this position (their bases were not counted toward depth)
COUNT_OF_MATE_RESOLUTIONS	Total number times that mate resolution needed to be done
NEAR_MAX_DEPTH	Flag to indicate if this position came within 1% of the max depth specified

perbase base-depth ./test/test.bam

Example output

REF     POS     DEPTH   A       C       G       T       N       R       Y       S       W       K       M       INS     DEL     REF_SKIP        FAIL    COUNT_OF_MATE_RESOUTIONS        NEAR_MAX_DEPTH
chr1    1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    2       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    3       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    4       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    5       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    6       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    7       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    8       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false
chr1    9       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       false

If the --mate-fix flag is passed, each position will first check if there are any mate overlaps and choose the mate with the hightest MAPQ, breaking ties by choosing the first mate that passes filters. Mates that are discarded are not counted toward FAIL or DEPTH.

If the --reference-fasta is supplied, the REF_BASE field will be filled in. The reference must be indexed and match the BAM/CRAM header of the input.

The output can be compressed and indexed as follows:

perbase base-depth -Z ./test/test.bam -o output.tsv.gz
tabix -S 1 -s 1 -b 2 -e 2 ./output.tsv.gz
# Query all positions overlapping region
tabix output.tsv.gz chr1:5-10

If the --mate-fix flag is passed, each position will first check if there are any mate overlaps and resolve based on the mate-resolution-strategy, mates that are discarded are not counted toward FAIL or DEPTH.

All strategies first check user-based read filters. If one mate fails filters, the other is chosen. If both fail, the first mate is chosen by default. For reads that are deletions / ref skips or lack a base call, all strategies fall back to the Original strategy (MAPQ → first in pair).

Strategy	Priority 1	Priority 2	Priority 3 (Tie-breaker)	Notes
BaseQualMapQualFirstInPair	Higher base quality	Higher MAPQ	First mate in pair	Standard quality-first approach
BaseQualMapQualIUPAC	Higher base quality	Higher MAPQ	IUPAC code (e.g., A+G→R)	Returns ambiguity codes for ties
BaseQualMapQualN	Higher base quality	Higher MAPQ	N (unknown base)	Conservative, marks ambiguous as N
MapQualBaseQualFirstInPair	Higher MAPQ	Higher base quality	First mate in pair	Prioritizes mapping confidence
MapQualBaseQualIUPAC	Higher MAPQ	Higher base quality	IUPAC code (e.g., A+G→R)	Mapping-first with ambiguity codes
MapQualBaseQualN	Higher MAPQ	Higher base quality	N (unknown base)	Mapping-first, conservative, marks ambiguous as N
IUPAC	—	—	IUPAC code	Always returns IUPAC code for different bases, same bases return themselves (A+A→A)
N	—	—	N or base	Returns N for different bases, same bases return themselves (A+A→A)
Original	Higher MAPQ	First mate in pair	First mate (default)	Simple MAPQ-based strategy

IUPAC Ambiguity Codes

When IUPAC strategies are used, the following codes are returned for base combinations:

Base 1	Base 2	IUPAC Code	Meaning
A	G	R	puRine (A or G)
C	T	Y	pYrimidine (C or T)
G	C	S	Strong (G or C)
A	T	W	Weak (A or T)
G	T	K	Keto (G or T)
A	C	M	aMino (A or C)
Any	Same	Original	Identical bases return themselves
Any	Other	N	Any combination not listed above

Strategy Selection Guide

Use BaseQual strategies when base quality is the most reliable indicator of accuracy
Use MapQual strategies when mapping quality is more trustworthy (e.g., repetitive regions)
Use IUPAC variants when you want to preserve ambiguity information for downstream analysis
Use N variants when you prefer conservative base calling
Use FirstInPair variants when you want deterministic results without ambiguity codes
Use IUPAC/N strategies when you don't trust quality scores and want base-only decisions
Use Original for backwards compatibility or simple MAPQ-based selection

Usage:

perbase-base-depth 0.10.3
Seth Stadick <sstadick@gmail.com>
Calculate the depth at each base, per-nucleotide

USAGE:
    perbase base-depth [FLAGS] [OPTIONS] <reads>

FLAGS:
    -Z, --bgzip                     
            Optionally bgzip the output

    -h, --help                      
            Prints help information

    -k, --keep-zeros                
            Keep positions even if they have 0 depth

    -m, --mate-fix                  
            Fix overlapping mates counts, see docs for full details

    -M, --skip-merging-intervals    
            Skip merging together regions specified in the optional BED or BCF/VCF files.
            
            **NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This
            may cause issues with downstream tooling.
    -V, --version                   
            Prints version information

    -z, --zero-base                 
            Output positions as 0-based instead of 1-based


OPTIONS:
    -B, --bcf-file <bcf-file>
            A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be
            reported on
    -b, --bed-file <bed-file>
            A BED file containing regions of interest. If specified, only bases from the given regions will be reported
            on
    -C, --channel-size-modifier <channel-size-modifier>
            The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default:
            0.15]
    -c, --chunksize <chunksize>
            The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) *
            chunksize [default: 1000000]
    -L, --compression-level <compression-level>
            The level to use for compressing output (specified by --bgzip) [default: 2]

    -T, --compression-threads <compression-threads>
            The number of threads to use for compressing output (specified by --bgzip) [default: 4]

    -F, --exclude-flags <exclude-flags>                          
            SAM flags to exclude, recommended 3848 [default: 0]

    -f, --include-flags <include-flags>                          
            SAM flags to include [default: 0]

    -M, --mate-resolution-strategy <mate-resolution-strategy>
            If `mate_fix` is true, select the method to use for mate fixing [default: original]

    -D, --max-depth <max-depth>
            Set the max depth for a pileup. If a positions depth is within 1% of max-depth the `NEAR_MAX_DEPTH` output
            field will be set to true and that position should be viewed as suspect [default: 100000]
    -Q, --min-base-quality-score <min-base-quality-score>
            Minium base quality for a base to be counted toward [A, C, T, G]. If the base is less than the specified
            quality score it will instead be counted as an `N`. If nothing is set for this no cutoff will be applied
    -q, --min-mapq <min-mapq>
            Minimum MAPQ for a read to count toward depth [default: 0]

    -o, --output <output>                                        
            Output path, defaults to stdout

        --ref-cache-size <ref-cache-size>
            Number of Reference Sequences to hold in memory at one time. Smaller will decrease mem usage [default: 10]

    -r, --ref-fasta <ref-fasta>                                  
            Indexed reference fasta, set if using CRAM

    -t, --threads <threads>                                      
            The number of threads to use [default: 10]


ARGS:
    <reads>    
            Input indexed BAM/CRAM to analyze

only-depth

The only-depth tool walks over the input BAM/CRAM file and calculates the depth over all positions specified by either a BED file or in the BAM/CRAM header. Adjacent positions that have the same depth will be merged together to form a non-inclusive range (see example output).

There are two distinct modes that only-depth can run in, gated by the --fast-mode flag. When running in fast-mode, only depth over the area a read covers is only determined by the reads start and end postions, and no cigar related info is taken into account. --mate-fix may still be used in this mode, and areas where mates overlap will not be counted twice.

Without the --fast-mode flag, the depth at each position is determined in a manner similar to base-depth where DEL will count toward depth, but REF_SKIP will not. Additionally, any reads that fail the --exclude-flags will not be counted toward depth. Lastly, --mate-fix can be applied to avoid counting regions twice where mates may overlap.

Regarding mate fixes, perbase will make "fixes" based only on the counted regions in a read. For example, if you have a read that goes from "chr1:0-1000" with a CIGAR of "25M974N1M", and the mate aligns nicely at "chr1:45-70" with CIGAR "25M", the mate will count toward the depth over "chr1:45-74". This is in contrast to other tools that will reject the mate even though it overlaps a region of R1 that is not counted toward depth.

For the fastest possible output, use only-depth --fast-mode.

Note that it is possible that two adjacent positions may not merge if they fall at a --chunksize boundary. If this is an issue you can set the --chunksize to the size of the largest contig in question. At a future date this may be fixed or a post processing tool may be provided to fix it. For most use cases this should not be a problem. Additionally, you can pipe into merge-adjacent which will fix it as well. EX: perbase only-depth -m file.bam | perbase merge-adjacent > out.tsv.

Example output of perbase only-depth --mate-fix --zero-base ./test/test.bam:

REF     POS     END     DEPTH
chr2    0       4       1
chr2    4       9       2
chr2    9       12      3
chr2    12      14      2
chr2    14      17      3
chr2    17      19      4
chr2    19      23      5
chr2    23      34      4
chr2    34      39      3
chr2    39      49      1
chr2    49      54      2
chr2    54      64      3
chr2    64      74      4
chr2    74      79      3
chr2    79      84      2
chr2    84      89      1

If a BED-like output is needed, --bed-format -z flags can be set, which will write a 0-based, no-header TSV output with an empty 4th column and the depth in the 5th column.

Usage:

Calculate the only the depth at each base

USAGE:
    perbase only-depth [FLAGS] [OPTIONS] <reads>

FLAGS:
        --bed-format                
            Output BED-like output format with the depth in the 5th column. Note, `-z` can be used with this to change
            coordinates to 0-based to be more BED-like
    -Z, --bgzip                     
            Optionally bgzip the output

    -x, --fast-mode                 
            Calculate depth based only on read starts/stops, see docs for full details

    -h, --help                      
            Prints help information

    -k, --keep-zeros                
            Keep positions even if they have 0 depth

    -m, --mate-fix                  
            Fix overlapping mates counts, see docs for full details

    -n, --no-merge                  
            Skip merging adjacent bases that have the same depth

    -M, --skip-merging-intervals    
            Skip mergeing togther regions specified in the optional BED or BCF/VCF files.
            
            **NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This
            may cause issues with downstream tooling.
    -V, --version                   
            Prints version information

    -z, --zero-base                 
            Output positions as 0-based instead of 1-based


OPTIONS:
    -B, --bcf-file <bcf-file>
            A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be
            reported on. Note that it may be more efficient to calculate depth over regions if your positions are
            clustered tightly together
    -b, --bed-file <bed-file>
            A BED file containing regions of interest. If specified, only bases from the given regions will be reported
            on
    -C, --channel-size-modifier <channel-size-modifier>
            The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default:
            0.001]
    -c, --chunksize <chunksize>
            The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) *
            chunksize [default: 1000000]
    -L, --compression-level <compression-level>
            The level to use for compressing output (specified by --bgzip) [default: 2]

    -T, --compression-threads <compression-threads>
            The number of threads to use for compressing output (specified by --bgzip) [default: 4]

    -F, --exclude-flags <exclude-flags>                    
            SAM flags to exclude, recommended 3848 [default: 0]

    -f, --include-flags <include-flags>                    
            SAM flags to include [default: 0]

    -q, --min-mapq <min-mapq>                              
            Minimum MAPQ for a read to count toward depth [default: 0]

    -o, --output <output>                                  
            Output path, defaults to stdout

    -r, --ref-fasta <ref-fasta>                            
            Indexed reference fasta, set if using CRAM

    -t, --threads <threads>                                
            The number of threads to use [default: 32]


ARGS:
    <reads>    
            Input indexed BAM/CRAM to analyze

merge-adjacent

merge-adjacent is a utility to merge overlapping regions in a BED-like file.

It will take a file with four columns and no header as long as the columns are like:

<contig>\t<start>\t<stop>\t<depth>\n

Or it can take files with three columns with headers that are like

<REF|chrom>\t<POS|chromStart>\t<END|chromEnd>\t<DEPTH|COV>

The END|chromEnd column is optional.

perbase-merge-adjacent 0.7.5-alpha.0
Seth Stadick <sstadick@gmail.com>
Merge adjacent intervals that have the same depth. Input must be sorted like: `sort -k1,1 -k2,2n in.bed > in.sorted.bed`

Generally accepts any file with no header tha is <chrom>\t<start>\t<stop>\t<depth>. The <stop> is optional. See
documentation for explaination of headers that are accepted.

USAGE:
    perbase merge-adjacent [FLAGS] [OPTIONS] [in-file]

FLAGS:
    -Z, --bgzip        
            Optionally bgzip the output

    -h, --help         
            Prints help information

    -n, --no-header    
            Indicate if the input file does not have a header

    -V, --version      
            Prints version information


OPTIONS:
    -T, --compression-level <compression-level>
            The level to use for compressing output (specified by --bgzip) [default: 2]

    -T, --compression-threads <compression-threads>
            The number of threads to use for compressing output (specified by --bgzip) [default: 32]

    -o, --output <output>                              
            The output location, defaults to STDOUT


ARGS:
    <in-file>    
            Input bed-like file, defaults to STDIN

EX:

perbase only-depth indexed.bam | perbase merge-adjacent > out.tsv

Similar Projects

Commit count: 163

perbase

documentation

README

Why?

Installation

Tools

base-depth

IUPAC Ambiguity Codes

Strategy Selection Guide

Usage:

only-depth

merge-adjacent

Similar Projects

cargo fmt