| Crates.io | fastats |
| lib.rs | fastats |
| version | 0.1.0 |
| created_at | 2025-08-17 12:28:23.303866+00 |
| updated_at | 2025-08-17 12:28:23.303866+00 |
| description | CLI to generate FASTA file statistics (masking, GC content, etc.). |
| homepage | https://github.com/roland-ewald/fastats |
| repository | https://github.com/roland-ewald/fastats |
| max_upload_size | |
| id | 1799368 |
| size | 59,209 |
CLI to generate statistics from FASTA files:
A|C|G|T), soft-masked (a|c|g|t), and hard-masked regions (n|N), per sequence.stdout and JSON.CLI to generate FASTA file statistics (masking, GC content, etc.).
Usage: fastats [OPTIONS] <FASTA_FILE>
Arguments:
<FASTA_FILE>
Options:
-o, --output-dir <OUTPUT_DIR>
The output directory for the BED and summary files. [default: .]
-q, --quiet
Do not print results on stdout.
--ignore-iupac
Enable this to avoid failing when encountering a sequence character that is not in ('A', 'C', 'T', 'G', 'N', 'a', 'c', 't', 'g', 'n').
--no-bed-output
Do not store masking regions into BED files.
--match-regex <SEQUENCE_MATCH_REGEX>
Regular expression to focus the analysis on sequences matching a specific regular expression. [default: .*]
-h, --help
Print help
-V, --version
Print version
For each sequence, BED files that report the non-masked, soft-masked, and hard-masked regions are define. They use the simple three-column BED format. Sample output:
chr9 0 10000
chr9 40529470 40529480
...
Summary statistics are printed out to stdout and into a summary.json file.
Sample output:
[
{
"sequence_name": "sample_sequence",
"non_masked_bases": 304,
"soft_masked_bases": 36936,
"hard_masked_bases": 0,
"non_masked_ratio": 0.00816326530612245,
"soft_masked_ratio": 0.9918367346938776,
"hard_masked_ratio": 0.0,
"gc_content": 0.4293233082706767,
"other_iupac_bases": 0,
"sequence_length": 37240,
"checksum_sha256": "4b2a8b27c0f83f7d72600e33af490149d027b3e6c1e81987730a7561cde563a8"
},
...
]
fastats hg38.fasta | jq '.[].sequence_name'
fastats hg38.fasta | jq '.[].sequence_length' | paste -sd+ | bc
_ in the namefastats hg38.fasta --match-regex "[^_]*"
Note that the base n is not considered soft-masked (so the sum of all non-masked, soft-masked, hard-masked, and non-supported IUPAC code bases equals the overall sequence length).
Ambiguous IUPAC codes (i.e., any code except N, A, C, G, or T) are not supported. To ingest sequences containing such IUPAC codes, use --ignore-iupac.