| Crates.io | chromsize |
| lib.rs | chromsize |
| version | 0.0.33 |
| created_at | 2024-08-15 17:25:31.285474+00 |
| updated_at | 2025-12-19 11:57:11.033336+00 |
| description | just get your chrom sizes |
| homepage | https://github.com/alejandrogzi/chromsize |
| repository | https://github.com/alejandrogzi/chromsize |
| max_upload_size | |
| id | 1339053 |
| size | 55,993 |
annoyed to have to create an index and cut it?
have to look for that old script every time?
got you. just get your chrom sizes. very fast.
but first, how is this better than any other option? yeah, just check the image below.
googled 'get chromosome sizes from fasta', grab every command/tool I found and benchmarked it. surprisingly, you can lose 14 seconds of your life just waiting for those chrom sizes to be calculated. crazy.
What's new on v.0.0.33?
- --fasta now is --sequence, change based on .fa/.fa.gz/.2bit inputs
- implementation of stdin mode!
- --sequence now defaults to -, so omitting it reads stdin; help text mentions stdin.
- SIMD newline/CR counting via bytecount
- Switched gzip inflate to flate2 with the faster zlib-ng-compat backend and enlarged output buffering
Usage: chromsize [--sequence <SEQUENCE>] --output <OUTPUT> [-t <THREADS>]
Arguments:
-s, --sequence <SEQUENCE>: Sequence file (FASTA/2bit, use '-' or omit to read stdin) [default: -]
-o, --output <OUTPUT>: path to chrom.sizes
Options:
-t, --threads <THREADS>: number of threads [default: your max ncpus]
-a, --accession-only only keep the accession id part of the header (stop after blank)
--help: print help
--version: print version
examples:
cat genome.fa | chromsize -o tmp.chromsizeszcat genome.fa.gz | chromsize -o tmp.chromsizes -s -chromsize -s genome.fa -o tmp.chromsizesinput format (FASTA/2bit) and gzip compression are auto-detected (from files or stdin) even if the extension is missing.
to install rust and use chromsize on your system follow this steps:
get installer: curl https://sh.rustup.rs -sSf | sh on unix, or go here for other options
run cargo install chromsize (make sure ~/.cargo/bin is in your $PATH before running it)
use chromsize with the required arguments
use chromsize;
fn main() {
let input = PathBuf::new("/path/to/genome.fa");
let output = PathBuf::new("/path/to/chrom.sizes");
let sizes: Vec<(String, u64)> = chromsize::get_sizes(&input).expect("failed to read sequence");
chromsize::writer(&sizes, &output).expect("failed to write chrom sizes");
}
build the port to install it as a pkg:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize/py-chromsize
hatch shell
maturin develop --release
use it as a binary wrapper [for .fa and .2bit]:
import chromsize as cs
input = "/path/to/genome.fa"
output = "/path/to/chrom.sizes"
cs.write_chromsizes(input, output)
or just get them directly
import chromsize as cs
input = "/path/to/genome.fa"
sizes = cs.get_chromsizes(input)
>>> print(sizes)
[
('chr1', 123),
('chr2', 456),
...
]
to build chromsize from this repo, do:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsizecargo run --release -- -s <SEQUENCE> -o <OUTPUT>to build the development container image:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsizestart docker or systemctl start dockerdocker image build --tag chromsize .docker run --rm -v "[dir_where_your_fa_is]:/dir" chromsize -s /dir/<SEQUENCE> -o /dir/<OUTPUT>to use chromsize through Conda just:
conda install chromsize -c bioconda or conda create -n chromsize -c bioconda chromsizeNextflow (not available yet)
do not believe me? run the benchmark on your own:
ASSEMBLIES const with the .fa you've downloadcargo run release --bin chromsize-benchmark -- -d /dir/where/my/fastas/are -a show-output ignore-failurehere is all the info and metadata from my experiment:
| Tool | Command | Reference | Discussion |
|---|---|---|---|
| seqkit | seqkit fx2tab --length --name --header-line {assembly} > chrom.sizes |
1 | 2 |
| chromsize | target/release/chromsize -s {assembly} -o chrom.sizes |
3 | |
| pyfaidx | faidx {assembly} -i chromsizes > chrom.sizes |
4 | 5 |
| samtools | samtools faidx {assembly} && wait | cut -f1,2 {assembly}.fai > chrom.sizes |
6 | 5 |
| faSize | faSize -detailed -tab {assembly} > chrom.sizes |
7 | |
| awk1 | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' {assembly} > chrom.sizes |
8 | 9 |
| awk2 | awk '/^>/{if (l!=") print l; print; l=0; next}{l+=length($0)}END{print l}' {assembly} > chrom.sizes |
8 | 9 |
| bioawk1 | bioawk -c fastx '{print > $name ORS length($seq)}' {assembly} > chrom.sizes |
10 | 9 |
| awk3 | cat {assembly} | awk '$0 ~ > {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom.sizes |
8 | 11 |
| bioawk2 | bioawk -c fastx '{ print $name, length($seq) }' < {assembly} > chrom.sizes |
10 | 2 |
| Species | Assembly | Size (Gb) | chromsize | seqKit | awk1 | awk2 | awk3 | bioawk1 | bioawk2 | faSize | pyfaidx | samtools |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S. cerevisiae | R64 | 0.01 | 0.004 | 0.016 (X 4.0) | 0.043 (X 10.7) | 0.043 (X 10.7) | 0.05 (X 12.5) | 0.03 (X 7.5) | 0.03 (X 7.5) | 0.054 (X 13.5) | 0.101 (X 25.2) | 0.064 (X 16.0) |
| C. elegans | ce11 | 0.10 | 0.02 | 0.103 (X 5.1) | 0.409 (X 20.4) | 0.408 (X 20.4) | 0.492 (X 24.6) | 0.274 (X 13.7) | 0.274 (X 13.7) | 0.426 (X 21.3) | 0.225 (X 11.2) | 0.472 (X 23.6) |
| D. melanogaster | dm6 | 0.14 | 0.028 | 0.147 (X 5.2) | 0.581 (X 20.7) | 0.583 (X 20.8) | 0.714 (X 25.5) | 0.426 (X 15.2) | 0.418 (X 14.9) | 0.633 (X 22.6) | 0.337 (X 12.0) | 0.667 (X 23.8) |
| D. rerio | danRer11 | 1.37 | 0.22 | 0.742 (X 3.4) | 6.815 (X 31.0) | 6.803 (X 30.9) | 8.216 (X 37.3) | 3.946 (X 17.9) | 3.95 (X 18.0) | 7.202 (X 32.7) | 3.029 (X 13.8) | 7.633 (X 34.7) |
| C. familiaris | canFam4 | 2.48 | 0.311 | 1.209 (X 3.9) | 10.158 (X 32.7) | 10.124 (X 32.6) | 12.206 (X 39.2) | 6.55 (X 21.1) | 6.518 (X 21.0) | 10.671 (X 34.3) | 4.741 (X 15.2) | 11.394 (X 36.6) |
| H. sapiens | GRCh38 | 3.10 | 0.43 | 1.696 (X 3.9) | 12.393 (X 28.8) | 12.432 (X 28.9) | 13.681 (X 31.8) | 7.414 (X 17.2) | 7.284 (X 16.9) | 13.102 (X 30.5) | 6.37 (X 14.8) | 14.074 (X 32.7) |
| B. bombina | aBomBom1 | 9.80 | 1.554 | 8.501 (X 5.5) | 41.676 (X 26.8) | 41.696 (X 26.8) | 49.064 (X 31.6) | 24.202 (X 15.6) | 24.374 (X 15.7) | 43.856 (X 28.2) | 19.755 (X 12.7) | 45.387 (X 29.2) |
| A. mexicanum | AmbMex60DD | 28.20 | 3.327 | 14.375 (X 4.3) | 118.923 (X 35.7) | 118.422 (X 35.6) | 137.781 (X 41.4) | 57.626 (X 17.3) | 57.591 (X 17.3) | 121.257 (X 36.4) | 54.82 (X 16.5) | 128.374 (X 38.6) |
| P. annectens | PAN1.0 | 40.10 | 4.606 | 18.664 (X 4.1) | 167.85 (X 36.4) | 165.701 (X 36.0) | 196.833 (X 42.7) | 91.747 (X 19.9) | 91.924 (X 20.0) | 170.475 (X 37.0) | 77.707 (X 16.9) | 181.562 (X 39.4) |
| Tool | Cores | Time |
|---|---|---|
| seqkit | 16 | 18.993 s ± 0.132 s |
| chromsize | default (max_cpus: 16) | 7.631 s ± 0.010 s |
| seqkit | default (4) | 18.525 s ± 0.520 s |
| chromsize | 4 | 8.035 s ± 0.077 s |
| seqkit | 2 | 18.535 s ± 0.376 s |
| chromsize | 2 | 8.284 s ± 0.030 s |