orphos-core

Crates.ioorphos-core
lib.rsorphos-core
version0.1.0
created_at2025-11-07 17:55:53.709331+00
updated_at2025-11-07 17:55:53.709331+00
descriptionCore library for Orphos, a tool for finding protein-coding genes in microbial genomes.
homepage
repositoryhttps://github.com/FullHuman/orphos
max_upload_size
id1921947
size3,128,991
Floriel (Ffloriel)

documentation

https://docs.rs/orphos-core

README

Orphos Core

Crates.io Documentation License: GPL v3

Core library for Orphos, a high-performance Rust implementation of Prodigal (prokaryotic gene prediction algorithms). This crate provides the foundational gene-finding capabilities for identifying protein-coding genes in microbial genomes.

Overview

orphos-core implements an unsupervised machine learning approach for finding genes in prokaryotic genomes. It uses dynamic programming and statistical models trained on genomic features to predict gene locations with high accuracy.

Key Features

  • 🚀 High Performance: Multi-threaded processing using Rayon for parallel analysis
  • 🔒 Type Safety: Compile-time guarantees using type-state pattern for training states
  • 🧬 Dual Modes: Single genome mode for complete genomes and metagenomic mode for fragments
  • 📊 Multiple Output Formats: GenBank, GFF3, GCA, and SCO formats
  • 🎯 Accurate: Advanced start codon recognition with Shine-Dalgarno detection
  • 💾 Memory Efficient: Optimized data structures for large-scale genomic analysis

Installation

Add to your Cargo.toml:

[dependencies]
orphos-core = "0.1.0"

Quick Start

Basic Usage

use orphos_core::{OrphosAnalyzer, config::OrphosConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create analyzer with default configuration
    let mut analyzer = OrphosAnalyzer::new(OrphosConfig::default());
    
    // Analyze a FASTA file
    let results = analyzer.analyze_file("genome.fasta")?;
    
    println!("Found {} genes", results.genes.len());
    println!("{}", results.output);
    
    Ok(())
}

Analyzing Sequences Directly

use orphos_core::{OrphosAnalyzer, config::OrphosConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut analyzer = OrphosAnalyzer::new(OrphosConfig::default());
    
    let sequence = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA...";
    let results = analyzer.analyze_sequence(sequence, Some("MyGenome".to_string()))?;
    
    for gene in &results.genes {
        println!("Gene at {}..{} on {} strand", 
            gene.start, gene.end, 
            if gene.strand == bio::bio_types::strand::Strand::Forward { "+" } else { "-" }
        );
    }
    
    Ok(())
}

Custom Configuration

use orphos_core::config::{OrphosConfig, OutputFormat};

let config = OrphosConfig {
    closed_ends: true,           // Complete genome (circular)
    mask_n_runs: true,           // Mask stretches of N's
    output_format: OutputFormat::Gff,
    num_threads: Some(4),        // Use 4 threads
    ..Default::default()
};

let mut analyzer = OrphosAnalyzer::new(config);

Metagenomic Mode

For analyzing short contigs or mixed community samples:

use orphos_core::config::OrphosConfig;

let config = OrphosConfig {
    metagenomic: true,
    ..Default::default()
};

let mut analyzer = OrphosAnalyzer::new(config);
let results = analyzer.analyze_file("metagenome.fasta")?;

Module Organization

  • config: Configuration options and output format settings
  • engine: Main analysis engine with training and prediction logic
  • types: Core data structures (Gene, Training, error types)
  • results: Gene prediction results and sequence information
  • sequence: Sequence encoding, I/O, and processing utilities
  • algorithms: Gene-finding algorithms including:
    • Dynamic programming for gene prediction
    • Gene optimization and overlap resolution
    • Scoring functions for connections
  • node: Gene node management, creation, and scoring
  • training: Training algorithms for Shine-Dalgarno and non-SD models
  • output: Output formatters for GenBank, GFF, GCA, and SCO
  • bitmap: Efficient sequence encoding utilities
  • metagenomic: Metagenomic mode presets and models

Output Formats

GenBank (.gbk)

Rich annotation format with full feature information:

LOCUS       MyGenome               4641652 bp    DNA     linear   BCT
FEATURES             Location/Qualifiers
     CDS             190..255
                     /gene="1"
                     /protein_id="MyGenome_1"
                     /translation="MTKRSAAAAAAVAAGMTSA"

GFF3 (.gff)

Standard genome annotation format:

##gff-version 3
MyGenome    Orphos  CDS  190  255  .  +  0  ID=MyGenome_1;

GCA (.gca)

Tab-delimited gene coordinate annotation.

SCO (.sco)

Simple coordinate output with minimal information.

Configuration Options

Option Type Default Description
metagenomic bool false Enable metagenomic mode for fragments
closed_ends bool false Treat sequences as complete genomes
mask_n_runs bool false Mask runs of N characters
force_non_sd bool false Disable Shine-Dalgarno detection
quiet bool false Suppress informational output
output_format OutputFormat Genbank Output format selection
translation_table Option<u8> None NCBI genetic code table (1-25)
num_threads Option<usize> None Number of parallel threads

Error Handling

All operations return Result<T, OrphosError> with detailed error types:

use orphos_core::types::OrphosError;

match analyzer.analyze_file("genome.fasta") {
    Ok(results) => println!("Success: {} genes", results.genes.len()),
    Err(OrphosError::SequenceTooShort { length, min }) => {
        eprintln!("Sequence too short: {} bp (minimum: {} bp)", length, min);
    }
    Err(OrphosError::IoError(e)) => {
        eprintln!("I/O error: {}", e);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Contributing

Contributions are welcome! Please see the main Orphos repository for contribution guidelines.

License

This project is licensed under the GNU General Public License v3.0 or later - see the LICENSE file for details.

Citation

If you use Orphos in your research, please cite:

@software{orphos,
  title = {Orphos: High-Performance Prokaryotic Gene Prediction},
  author = {Floriel Fedry},
  year = {2025},
  url = {https://github.com/FullHuman/orphos}
}

Related Projects

Acknowledgments

This implementation is based on Prodigal, originally developed by Doug Hyatt. Orphos provides a modern, type-safe Rust implementation while maintaining compatibility with the original algorithms.

Support

Commit count: 0

cargo fmt