fastats

Crates.iofastats
lib.rsfastats
version0.1.0
created_at2025-08-17 12:28:23.303866+00
updated_at2025-08-17 12:28:23.303866+00
descriptionCLI to generate FASTA file statistics (masking, GC content, etc.).
homepagehttps://github.com/roland-ewald/fastats
repositoryhttps://github.com/roland-ewald/fastats
max_upload_size
id1799368
size59,209
Roland Ewald (roland-ewald)

documentation

README

fastats

CLI to generate statistics from FASTA files:

  • Generates BED files for non-masked (A|C|G|T), soft-masked (a|c|g|t), and hard-masked regions (n|N), per sequence.
  • Stores overall statistics (GC content, ratios of masked bases) to stdout and JSON.

Details

CLI to generate FASTA file statistics (masking, GC content, etc.).

Usage: fastats [OPTIONS] <FASTA_FILE>

Arguments:
  <FASTA_FILE>  

Options:
  -o, --output-dir <OUTPUT_DIR>
          The output directory for the BED and summary files. [default: .]
  -q, --quiet
          Do not print results on stdout.
      --ignore-iupac
          Enable this to avoid failing when encountering a sequence character that is not in ('A', 'C', 'T', 'G', 'N', 'a', 'c', 't', 'g', 'n').
      --no-bed-output
          Do not store masking regions into BED files.
      --match-regex <SEQUENCE_MATCH_REGEX>
          Regular expression to focus the analysis on sequences matching a specific regular expression. [default: .*]
  -h, --help
          Print help
  -V, --version
          Print version

Sample output

Bed file per sequence

For each sequence, BED files that report the non-masked, soft-masked, and hard-masked regions are define. They use the simple three-column BED format. Sample output:

chr9 0 10000
chr9 40529470 40529480
...

Summary statistics

Summary statistics are printed out to stdout and into a summary.json file. Sample output:

[
  {
    "sequence_name": "sample_sequence",
    "non_masked_bases": 304,
    "soft_masked_bases": 36936,
    "hard_masked_bases": 0,
    "non_masked_ratio": 0.00816326530612245,
    "soft_masked_ratio": 0.9918367346938776,
    "hard_masked_ratio": 0.0,
    "gc_content": 0.4293233082706767,
    "other_iupac_bases": 0,
    "sequence_length": 37240,
    "checksum_sha256": "4b2a8b27c0f83f7d72600e33af490149d027b3e6c1e81987730a7561cde563a8"
  },
  ...
]

Usage examples

Get sorted list of sequence names

fastats hg38.fasta | jq '.[].sequence_name'

Calculate the overall sequence length

fastats hg38.fasta | jq '.[].sequence_length' | paste -sd+ | bc

Print stats for all sequences without a _ in the name

fastats hg38.fasta --match-regex "[^_]*"

Notes

  • Note that the base n is not considered soft-masked (so the sum of all non-masked, soft-masked, hard-masked, and non-supported IUPAC code bases equals the overall sequence length).

  • Ambiguous IUPAC codes (i.e., any code except N, A, C, G, or T) are not supported. To ingest sequences containing such IUPAC codes, use --ignore-iupac.

Commit count: 0

cargo fmt