thaf

Crates.iothaf
lib.rsthaf
version0.0.3
created_at2025-07-08 20:40:17.949011+00
updated_at2025-07-22 07:23:07.617886+00
descriptionExtracts transcript sequences and gene maps from genome FASTA files using GFF3 annotations.
homepage
repositoryhttps://github.com/bourumir-wyngs/thaf
max_upload_size
id1743541
size65,395
Bourumir (bourumir-wyngs)

documentation

README

GitHub Workflow Status crates.io crates.io crates.io

GFF3 Parser and Transcriptome Extractor

Overview

thaf is a command-line tool to extract transcript sequences from a genome FASTA file based on GFF3 annotation files. It can also generate transcript-to-gene mapping files compatible with tools such as Salmon.

Features

  • Parses GFF3 annotation files to identify transcript regions.
  • The default feature to be extracted is 'exon', but this is easy to change with the -e switch
  • Extracts transcript sequences directly from genome FASTA files.
  • Handles forward and reverse strands automatically.
  • Generates transcript-to-gene mapping files.

Usage

Command-Line Arguments

thaf \
  -f <INPUT_GFF3> \
  -d <DNA_FASTA> \
  -t <OUTPUT_FASTA> \
  [-g <GENEMAP_FILE>]
  [-e <FEATURES>]

Required Arguments

  • -f, --gff3 <INPUT_GFF3>: Path to the input GFF3 annotation file.
  • -d, --dna <DNA_FASTA>: Path to the input genome FASTA file.
  • -t, --transcriptome <OUTPUT_FASTA>: Path to the output transcriptome FASTA file.

Optional Arguments

  • -g, --genemap <GENEMAP_FILE>: Path to the output TSV file for transcript-to-gene mapping.
  • -e, --features <FEATURES>: Comma-separated list of GFF3 features to extract (default: exon).

Example

thaf \
  -f annotations.gff3 \
  -d genome.fa \
  -t transcriptome.fa \
  -g genemap.tsv
  -e CDS

This will produce:

  • transcriptome.fa: FASTA file containing extracted transcript sequences.
  • genemap.tsv: Tab-separated file mapping transcripts to genes.

This small project was inspired by a segmentation fault encountered while using one of the popular tools, and the lack of any readily available tool capable of producing even a simple genemap table.

Sequence boundaries, exon order, and reverse-complementation have been validated against outputs from gffread, which unfortunately does not produce a genemap. thaf checks for obvious inconsistencies, such as overlapping exons or exons belonging to different strands or chromosomes.

Unlike gffread, thaf loads the entire genome into memory. As a result, it cannot handle extremely large genomes, such as that of the fern Tmesipteris oblanceolata (~160 Gb). However, a typical 32 Gb workstation is enough for processing the crop and plant genomes we commonly work with, and the simpler algorithm should make the code easier to maintain.

We are grateful to the rust-bio package, which provides exon overlap detection and reverse-complement functionality.

Commit count: 0

cargo fmt