| Crates.io | thaf |
| lib.rs | thaf |
| version | 0.0.3 |
| created_at | 2025-07-08 20:40:17.949011+00 |
| updated_at | 2025-07-22 07:23:07.617886+00 |
| description | Extracts transcript sequences and gene maps from genome FASTA files using GFF3 annotations. |
| homepage | |
| repository | https://github.com/bourumir-wyngs/thaf |
| max_upload_size | |
| id | 1743541 |
| size | 65,395 |
thaf is a command-line tool to extract transcript sequences from a genome FASTA file based on GFF3 annotation files. It can also generate transcript-to-gene mapping files compatible with tools such as Salmon.
thaf \
-f <INPUT_GFF3> \
-d <DNA_FASTA> \
-t <OUTPUT_FASTA> \
[-g <GENEMAP_FILE>]
[-e <FEATURES>]
-f, --gff3 <INPUT_GFF3>: Path to the input GFF3 annotation file.-d, --dna <DNA_FASTA>: Path to the input genome FASTA file.-t, --transcriptome <OUTPUT_FASTA>: Path to the output transcriptome FASTA file.-g, --genemap <GENEMAP_FILE>: Path to the output TSV file for transcript-to-gene mapping.-e, --features <FEATURES>: Comma-separated list of GFF3 features to extract (default: exon).thaf \
-f annotations.gff3 \
-d genome.fa \
-t transcriptome.fa \
-g genemap.tsv
-e CDS
This will produce:
transcriptome.fa: FASTA file containing extracted transcript sequences.genemap.tsv: Tab-separated file mapping transcripts to genes.This small project was inspired by a segmentation fault encountered while using one of the popular tools, and the lack of any readily available tool capable of producing even a simple genemap table.
Sequence boundaries, exon order, and reverse-complementation have been validated against outputs from gffread, which unfortunately does not produce a genemap. thaf checks for obvious inconsistencies, such as overlapping exons or exons belonging to different strands or chromosomes.
Unlike gffread, thaf loads the entire genome into memory. As a result, it cannot handle extremely large genomes, such as that of the fern Tmesipteris oblanceolata (~160 Gb). However, a typical 32 Gb workstation is enough for processing the crop and plant genomes we commonly work with, and the simpler algorithm should make the code easier to maintain.
We are grateful to the rust-bio package, which provides exon overlap detection and reverse-complement functionality.