Crates.io | seq_io |
lib.rs | seq_io |
version | |
source | src |
created_at | 2017-07-31 11:59:44.300844 |
updated_at | 2025-01-28 21:57:17.675295 |
description | Fast FASTA and FASTQ readers |
homepage | |
repository | https://github.com/markschl/seq_io |
max_upload_size | |
id | 25803 |
Cargo.toml error: | TOML parse error at line 20, column 1 | 20 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
This library provides an(other) attempt at parsing of the sequence formats FASTA and FASTQ, as well as writing.
Features:
The FASTA parser can read and write multi-line files and allows iterating over the sequence lines without doing any allocation or copying. The FASTQ parser does not support multiple sequence / quality lines.
Documentation for the stable version (0.3.x)
The v0.4 branch contains code for a new version, which includes a FASTX reader. Although it works and has been tested to some extent, there will be further large changes, which are not quite ready yet.
Documentation for development version (0.4.0-alpha.x)
Reads FASTA sequences from STDIN and writes them to STDOUT
if long enough. Otherwise it prints a message. This should
be very fast because the sequence is not allocated (seq_lines()
).
use seq_io::fasta::{Reader,Record};
use std::io;
let mut reader = Reader::new(io::stdin());
let mut stdout = io::stdout();
while let Some(result) = reader.next() {
let record = result.unwrap();
// determine sequence length
let seqlen = record.seq_lines()
.fold(0, |l, seq| l + seq.len());
if seqlen > 100 {
record.write_wrap(&mut stdout, 80).unwrap();
} else {
eprintln!("{} is only {} long", record.id().unwrap(), seqlen);
}
}
Records are directly borrowing data from the internal buffered reader,
therefore the while let
is required. By default, the buffer will automatically
grow if a record is too large to fit in. How it grows can be configured, it is
also possible to set a size limit. Iterators over owned records are also provided.
Note: LTO might be evaluated to see, whether it improves performance. But generally, library should work reasonably fast without LTO, too.
The parallel
module contains functions for sending FASTQ/FASTA
records to a thread pool where expensive calculations are done.
Sequences are processed in batches (RecordSet
) because sending across
channels has a performance impact. FASTA/FASTQ records can be accessed in
both the 'worker' function and (after processing) a function running in the
main thread.
seq_io
was inspired by fastq_rs
.The FASTQ reader from this crate performs similar to the fastq-rs reader. The rust-bio readers are slower due to allocations, copying, and UTF-8 validity checks.
All comparisons were run on a set of 100,000 auto-generated, synthetic sequences with lengths normally distributed around 500 bp and loaded into memory. The parsers from this crate (seq_io) are compared with fastq-rs (fastq_rs) and Rust-Bio (bio). The bars represent the throughput in GB/s (+/- standard error of the mean). Run on a Thinkpad X1 Carbon (i7-5500U) with a fixed frequency of 2.3 GHz using Rust 1.31 nightly
Explanation of labels:
read_record_set()
(involves some copying).