---
title: 'Rasusa: Randomly subsample sequencing reads to a specified coverage'
tags:
  - Rust
  - bioinformatics
  - genomics
  - fastq
  - fasta
  - subsampling
  - random
authors:
  - name: Michael B. Hall
    orcid: 0000-0003-3683-6208
    affiliation: 1
affiliations:
 - name: European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Hinxton, UK
   index: 1
date: 18 October 2021
bibliography: paper.bib
---

# Summary

A fundamental requirement for many applications in genomics is the sequencing of genetic
material (DNA/RNA). Different sequencing technologies exist, but all aim to accurately
reproduce the sequence of nucleotides (the individual units of DNA and RNA) in the
genetic material under investigation. The result of such efforts is a text file
containing the individual fragments of genetic material - termed "reads" - represented
as strings of letters (A, C, G, and T/U).

The amount of data in one of these read files depends on how much genetic material was
present and how long the sequencing device was operated. Read depth (coverage) is a
measure of the volume of genetic data contained in a read file. For example, coverage of
5x indicates that, on average, each nucleotide in the original genetic material is represented five
times in the read file.

Many of the computational methods employed in genomics are affected by coverage;
counterintuitively, more is not always better. For example, because sequencing devices
are not perfect, reads inevitably contain errors. As such, higher coverage increases the
number of errors and potentially makes them look like alternative sequences.
Furthermore, for some applications, too much coverage can cause a degradation in
computational performance via increased runtimes or memory usage.

We present Rasusa, a software program that randomly subsamples a given read file to a
specified coverage. Rasusa is written in the Rust programming language and is much
faster than current solutions for subsampling read files. In addition, it provides an
ergonomic command-line interface and allows users to specify a desired coverage or a
target number of nucleotides.


# Statement of need

Read subsampling is a useful mechanism for creating artificial datasets, allowing
exploration of a computational method's performance as data becomes more scarce. In
addition, the coverage of a sample can have a significant impact on a variety of
computational methods, such as RNA-seq [@Baccarella2018], taxonomic classification
[@Gweon2019], antimicrobial resistance detection [@Gweon2019], and genome assembly
[@Maio2019] - to name a few.

There is limited available software for subsampling read files. Assumably, most
researchers use custom scripts for this purpose. However, two existing programs for
subsampling are Filtlong [@filtlong] and Seqtk [@seqtk]. Unfortunately, neither of these
tools provides subsampling to a specified coverage "out of the box".

Filtlong is technically a filtering tool, not a subsampling one. It scores each read
based on its length and quality and outputs the highest-scoring subset. Additionally,
minimum and maximum read lengths can be specified, along with the size of the subset
required. Ultimately, the subset produced by Filtlong is not necessarily representative
of the original reads but is biased towards those with the greatest length or quality.
While this may sound like a good thing, in some applications, such as genome assembly,
it has been shown that a random subsample produces superior results to a filtered subset
[@Maio2019].

Seqtk does do random subsampling via the sample subcommand. However, the only option
available is to specify the number of reads required. Thus, it is up to the user to
determine the number of reads required to reach the desired coverage. While this serves
for Illumina sequencing data, which generally have uniform(ish) read lengths, it does
not work for other modalities like PacBio and Nanopore, where read lengths vary
significantly.

Rasusa provides a random subsample of a read file (FASTA or FASTQ), with two ways of
specifying the size of the subset. One method takes a genome size and the desired
coverage, while the other takes a target number of bases (nucleotides). In the genome
size and coverage option, we multiply the genome size by the coverage to obtain the
target number of bases for the subset. As such, the resulting read file will have, on 
average, the amount of coverage requested. In addition, Rasusa allows setting a random seed
to allow reproducible subsampling. Other features include user control over whether the
output is compressed and specifying the compression algorithm and level.

Rasusa is 21 and 1.2 times faster than Filtlong and Seqtk, respectively.

# Availability

Rasusa is open-source and available under an MIT license at
https://github.com/mbhall88/rasusa.

# Acknowledgements

We acknowledge contributions from Pierre Marijon and suggestions from Zamin Iqbal. In
addition, MBH is funded by the EMBL International PhD Programme.

# References