Crates.io | chromsize |
lib.rs | chromsize |
version | 0.0.2 |
source | src |
created_at | 2024-08-15 17:25:31.285474 |
updated_at | 2024-08-16 06:22:09.996748 |
description | just get your chrom sizes |
homepage | https://github.com/alejandrogzi/chromsize |
repository | https://github.com/alejandrogzi/chromsize |
max_upload_size | |
id | 1339053 |
size | 30,893 |
annoyed to have to create an index and cut it?
have to look for that old script every time?
got you. just get your chrom sizes. very fast.
but first, how is this better than any other option? yeah, just check the image below.
googled 'get chromosome sizes from fasta', grab every command/tool I found and benchmarked it. surprisingly, you can lose 14 seconds of your life just waiting for those chrom sizes to be calculated. crazy.
What's new on v.0.0.2?
- now reads .gz!
- CI implementation
Usage: chromsize --fasta <FASTA> --output <OUTPUT> [-t <THREADS>]
Arguments:
-f, --fasta <FASTA>: FASTA file
-o, --output <OUTPUT>: path to chrom.sizes
Options:
-t, --threads <THREADS>: number of threads [default: your max ncpus]
--help: print help
--version: print version
to install rust and use chromsize on your system follow this steps:
get installer: curl https://sh.rustup.rs -sSf | sh
on unix, or go here for other options
run cargo install chromsize
(make sure ~/.cargo/bin
is in your $PATH
before running it)
use chromsize
with the required arguments
use chromsize;
fn main() {
let input = PathBuf::new("/path/to/fasta.fa");
let output = PathBuf::new("/path/to/chrom.sizes");
let sizes: Vec<(String, u64)> = chromsize::chromsize(&input);
chromsize::write(sizes, &output)
}
build the port to install it as a pkg:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize/py-chromsize
hatch shell
maturin develop --release
use it as a binary wrapper:
import chromsize as cs
input = "/path/to/fasta.fa"
output = "/path/to/chrom.sizes"
cs.write_chromsizes(input, output)
or just get them directly
import chromsize as cs
input = "/path/to/fasta.fa"
sizes = cs.get_chromsizes(input)
>>> print(sizes)
[
('chr1', 123),
('chr2', 456),
...
]
to build chromsize from this repo, do:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize
cargo run --release -- -i <GTF> -o <OUTPUT>
to build the development container image:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize
start docker
or systemctl start docker
docker image build --tag chromsize .
docker run --rm -v "[dir_where_your_fa_is]:/dir" chromsize -f /dir/<INPUT> -o /dir/<OUTPUT>
Conda (not available yet)
to use chromsize through Conda just:
conda install chromsize -c bioconda
orconda create -n chromsize -c bioconda chromsize
Nextflow (not available yet)
do not believe me? run the benchmark on your own:
ASSEMBLIES
const with the .fa you've downloadcargo run release --bin chromsize-benchmark -- -d /dir/where/my/fastas/are -a show-output ignore-failure
here is all the info and metadata from my experiment:
Tool | Command | Reference | Discussion |
---|---|---|---|
seqkit | seqkit fx2tab --length --name --header-line {assembly} > chrom.sizes |
1 | 2 |
chromsize | target/release/chromsize -f {assembly} -o chrom.sizes |
3 | |
pyfaidx | faidx {assembly} -i chromsizes > chrom.sizes |
4 | 5 |
samtools | samtools faidx {assembly} && wait | cut -f1,2 {assembly}.fai > chrom.sizes |
6 | 5 |
faSize | faSize -detailed -tab {assembly} > chrom.sizes |
7 | |
awk1 | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' {assembly} > chrom.sizes |
8 | 9 |
awk2 | awk '/^>/{if (l!=") print l; print; l=0; next}{l+=length($0)}END{print l}' {assembly} > chrom.sizes |
8 | 9 |
bioawk1 | bioawk -c fastx '{print > $name ORS length($seq)}' {assembly} > chrom.sizes |
10 | 9 |
awk3 | cat {assembly} | awk '$0 ~ > {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom.sizes |
8 | 11 |
bioawk2 | bioawk -c fastx '{ print $name, length($seq) }' < {assembly} > chrom.sizes |
10 | 2 |
Species | Assembly | Size (Gb) | chromsize | seqKit | awk1 | awk2 | awk3 | bioawk1 | bioawk2 | faSize | pyfaidx | samtools |
---|---|---|---|---|---|---|---|---|---|---|---|---|
S. cerevisiae | R64 | 0.01 | 0.004 | 0.016 (X 4.0) | 0.043 (X 10.7) | 0.043 (X 10.7) | 0.05 (X 12.5) | 0.03 (X 7.5) | 0.03 (X 7.5) | 0.054 (X 13.5) | 0.101 (X 25.2) | 0.064 (X 16.0) |
C. elegans | ce11 | 0.10 | 0.02 | 0.103 (X 5.1) | 0.409 (X 20.4) | 0.408 (X 20.4) | 0.492 (X 24.6) | 0.274 (X 13.7) | 0.274 (X 13.7) | 0.426 (X 21.3) | 0.225 (X 11.2) | 0.472 (X 23.6) |
D. melanogaster | dm6 | 0.14 | 0.028 | 0.147 (X 5.2) | 0.581 (X 20.7) | 0.583 (X 20.8) | 0.714 (X 25.5) | 0.426 (X 15.2) | 0.418 (X 14.9) | 0.633 (X 22.6) | 0.337 (X 12.0) | 0.667 (X 23.8) |
D. rerio | danRer11 | 1.37 | 0.22 | 0.742 (X 3.4) | 6.815 (X 31.0) | 6.803 (X 30.9) | 8.216 (X 37.3) | 3.946 (X 17.9) | 3.95 (X 18.0) | 7.202 (X 32.7) | 3.029 (X 13.8) | 7.633 (X 34.7) |
C. familiaris | canFam4 | 2.48 | 0.311 | 1.209 (X 3.9) | 10.158 (X 32.7) | 10.124 (X 32.6) | 12.206 (X 39.2) | 6.55 (X 21.1) | 6.518 (X 21.0) | 10.671 (X 34.3) | 4.741 (X 15.2) | 11.394 (X 36.6) |
H. sapiens | GRCh38 | 3.10 | 0.43 | 1.696 (X 3.9) | 12.393 (X 28.8) | 12.432 (X 28.9) | 13.681 (X 31.8) | 7.414 (X 17.2) | 7.284 (X 16.9) | 13.102 (X 30.5) | 6.37 (X 14.8) | 14.074 (X 32.7) |
B. bombina | aBomBom1 | 9.80 | 1.554 | 8.501 (X 5.5) | 41.676 (X 26.8) | 41.696 (X 26.8) | 49.064 (X 31.6) | 24.202 (X 15.6) | 24.374 (X 15.7) | 43.856 (X 28.2) | 19.755 (X 12.7) | 45.387 (X 29.2) |
A. mexicanum | AmbMex60DD | 28.20 | 3.327 | 14.375 (X 4.3) | 118.923 (X 35.7) | 118.422 (X 35.6) | 137.781 (X 41.4) | 57.626 (X 17.3) | 57.591 (X 17.3) | 121.257 (X 36.4) | 54.82 (X 16.5) | 128.374 (X 38.6) |
P. annectens | PAN1.0 | 40.10 | 4.606 | 18.664 (X 4.1) | 167.85 (X 36.4) | 165.701 (X 36.0) | 196.833 (X 42.7) | 91.747 (X 19.9) | 91.924 (X 20.0) | 170.475 (X 37.0) | 77.707 (X 16.9) | 181.562 (X 39.4) |
Tool | Cores | Time |
---|---|---|
seqkit | 16 | 18.993 s ± 0.132 s |
chromsize | default (max_cpus: 16) | 7.631 s ± 0.010 s |
seqkit | default (4) | 18.525 s ± 0.520 s |
chromsize | 4 | 8.035 s ± 0.077 s |
seqkit | 2 | 18.535 s ± 0.376 s |
chromsize | 2 | 8.284 s ± 0.030 s |