\documentclass{bioinfo} \copyrightyear{2022} \pubyear{2022} \usepackage{graphicx} \usepackage{hyperref} \usepackage{url} \usepackage{tabularx} \usepackage{amsmath} \usepackage[ruled,vlined]{algorithm2e} \newcommand\mycommfont[1]{\footnotesize\rmfamily{\it #1}} \SetCommentSty{mycommfont} \SetKwComment{Comment}{$\triangleright$\ }{} \usepackage{natbib} \bibliographystyle{apalike} \DeclareMathOperator*{\argmax}{argmax} \begin{document} \firstpage{1} \title[Aligning proteins to genomes with miniprot]{Protein-to-genome alignment with miniprot} \author[Li]{Heng Li$^{1,2}$} \address{$^1$Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA, $^2$Harvard Medical School, 10 Shattuck St, Boston, MA 02215, USA} \maketitle \begin{abstract} \section{Motivation:} Protein-to-genome alignment is critical to annotating genes in non-model organisms. While there are a few tools for this purpose, all of them were developed over ten years ago and did not incorporate the latest advances in alignment algorithms. They are inefficient and could not keep up with the rapid production of new genomes and quickly growing protein databases. \section{Results:} Here we describe miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as k-mer sketch and SIMD-based dynamic programming. It is tens of times faster than existing tools while achieving comparable accuracy on real data. \section{Availability and implementation:} \href{https://github.com/lh3/miniprot}{https://github.com/lh3/miniprot} \section{Contact:} hli@ds.dfci.harvard.edu \end{abstract} \section{Introduction} Sequencing technologies have been rapidly evolving in recent years. The advent of long-read sequencing, especially accurate long-read sequencing~\citep{Wenger:2019ab}, have enabled high-quality genome assembly at scale~\citep{Nurk:2020we,Cheng:2021aa,Cheng:2022aa}. After we sequence and assemble the genome of a new species, the immediate next step is to annotate genes. There are three ways to annotate gene structures: \emph{ab initio} gene prediction, aligning RNA-seq data from the same species and mapping known genes with cross-species alignment. While \emph{ab initio} gene prediction works well for bacterial genomes, it is error-prone for Eukaryotic genomes that may contain large introns. In a recent benchmark~\citep{Scalzitti:2020wg}, all the evaluated gene finders miss $\sim$50\% nucleotides in annotated exons and predict $\sim$50\% extra sequences not in exons. If we have RNA-seq data, we can map short or long RNA-seq reads~\citep{Dobin:2013kx,Li:2018ab} and reconstruct transcripts from the alignment~\citep{Kovaka:2019wf}. This will give much more accurate gene structures than \emph{ab initio} gene prediction. Unfortunately, RNA sequencing adds extra cost and may miss genes lowly expressed in the tissues being sequenced. We still rely on cross-species alignment to derive a complete gene set and to transfer known functional annotations to the new genome. For very closely related genomes, we can reconstruct gene structures from whole-genome alignment~\citep{Fiddes:2018wn} or from the alignment of gene regions~\citep{Shumate:2020ty}. These methods would not work well for genomes at longer evolutionary distances because intron sequences are less conserved and this will affect the quality of the alignment. Aligning the more conserved coding regions~\citep{Li:2007aa,Gotoh:2008aa} may alleviate the issue. However, for distantly related species, even coding nucleotide sequences are not conserved well. Just as we almost exclusively use protein sequences to reconstruct the phylogeny of distant homologs, Ensembl~\citep{Aken:2016wr} and other gene annotation pipelines~\citep{Haas:2008tv,Holt:2011tt,Bruna:2021ug} also heavily rely on protein-to-genome alignment especially when the annotation of closely related species is not available. There are several protein-to-genome aligners that pinpoint exact splice sites: GeneWise~\citep{Birney:1997vr,Birney:2004uy}, Exonerate~\citep{Slater:2005aa}, GeneSeqer~\citep{Usuka:2000vi}, GenomeThreader~\citep{DBLP:journals/infsof/GremmeBSK05}, genBlastG~\citep{She:2011aa}, ProSplign~\citep{Kapustin:2008tq} and Spaln2~\citep{Gotoh:2008aa,Iwata:2012aa}. They all take a protein and a nucleotide sequence as input and output spliced alignment between them. GeMoMa~\citep{Keilwagen:2019wz} additionally requires gene structures in the source genome as input. It aligns exons without splicing and then connects exon alignments. As gene structures are conserved across species, this strategy simplifies alignment and potentially reduces spurious hits but it would not easily work for proteins from a variety of species. Among the tools above, only Spaln2, GenomeThreader and GeMoMa are practical for whole-genome alignment. They can align several hundred proteins per CPU hour and may take a couple of days to align a few hundred thousand proteins often needed to annotate a genome without closely homology. Protein-to-genome alignment is time consuming. It is challenging to develop a fast and accurate protein-to-genome alignment algorithm. The core of such alignment is a dynamic programming (DP) that jointly considers affine gap penalties, introns and frameshift. It is perhaps the most complex DP for pairwise alignment. In addition, as we will show later, a successful aligner functions like a gene finder and has to properly model splice signals, which is not a trivial task, either. On top of these, we need to fit these complex methods to an efficient implementation with modern computing techniques. This is partly why we have over a hundred short-read mappers~\citep{Alser:2021tk} but only three protein-to-genome mappers capable of whole-genome alignment. In this article, we will describe miniprot, a new protein-to-genome aligner developed from scratch. We will demonstrate its performance and accuracy on real data along with the few existing algorithms. \begin{methods} \section{Methods} Miniprot broadly follows the seed-chain-extend strategy used by minimap2~\citep{Li:2018ab}. It indexes the genome k-mers in all six open reading frames (ORFs) on both strands. During alignment, miniprot extracts k-mers on a query protein, finds seed anchors and then performs chaining. It closes unaligned regions between anchors and extends from terminal anchors with dynamic programming (DP). \subsection{Notations of strings} For a string $T$, let $|T|$ be its length and $T[i]$, $i=1,\ldots,|T|$, be the $i$-th symbol in $T$. $T[i,j]$, $1\le i\le j\le|T|$, is the substring starting at $i$ and ending at $j$ inclusively. In this article, $T$ denotes the genome sequence over the nucleotide alphabet and $P$ denotes the protein sequence over the amino acid alphabet. \subsection{Reduced alphabet} There are twenty amino acids. We need at least five bits to encode each amino acid. To encode protein sequences more compactly, we reduce the amino acid alphabet using the SE-B(14) scheme by ~\citet{Edgar:2004aa}, except that we merge N and D. More exactly, we map amino acid groups to integers as follows: A$\to$0, ST$\to$1, RK$\to$2, H$\to$3, ND$\to$4, EQ$\to$5, C$\to$6, P$\to$7, G$\to$8, IV$\to$10, LM$\to$11, FY$\to$12, W$\to$13, $\ast$$\to$14 and X$\to$15, where $\ast$ denotes the stop codon and X denotes an amino acid. Under this encoding, if two amino acid groups only differ at the lowest bit (e.g. group `A' and `ST'), the two groups tend to be similar. We may flip the lowest bit of an integer to generate more seeds and thus to increase the seeding sensitivity. We did not use this strategy as miniprot seems reasonably sensitive on real data. \subsection{Indexing the genome} Miniprot only indexes a subset of k-mers in the genome. Suppose $\phi(a)$ maps an amino acid $a$ to a 4-bit integer with the scheme described above. The integer encoding of a $k$-long protein sequence $P$ can be recursively defined as $\phi(P)=\phi(P[1,k-1])\times16+\phi(P[k])$. $\phi(P)$ has $4k$ bits. Let $B=\psi(\phi(P))$ where $\psi(\cdot)$ is an invertible integer hash function~\citep{Li:2016aa} over $[0,2^{4k})$. Then $B$ is also an integer with $4k$ bits. By default, miniprot only indexes $B$ if the lowest bit of $B$ is 0. We thus sample 50\% of k-mers in average with a high-quality hash function $\psi(\cdot)$. Internally, miniprot treats each genome sequence and its reverse complement as two independent sequences. It enumerates all ORFs of 30 amino acids or longer and samples 6-mers from translated ORFs with the strategy above. For each selected k-mer $R$ at position $x$, miniprot stores $(\psi(\phi(R)), \lfloor x/256\rfloor)$ in a hash table with the key being $\psi(\phi(R))$ and the value being an array of positions. We do not retain the base resolution at the indexing step such that we can use 32-bit integers to store positions for a genome up to $2^{39}$ ($=2^{32}\cdot 256/2$) base pairs in size. Without binning, miniprot would have to use 64-bit integers to store positions in a human genome, which would double the index size. \subsection{Chaining} The miniprot chaining algorithm is similar to the minimap2 algorithm. However, because the miniprot index does not keep the exact genome positions, the gap size calculation needs to be modified. For completeness, we will describe the full chaining equation here. Let 2-tuple $(x,y)$ denote a seed match, also known as an anchor, between binned position $x$ on the genome and residue position $y$ on the protein. Suppose $(x_i,y_i)$ and $(x_j,y_j)$ are two anchors with $x_i\le x_j$ and $y_i256(\Delta x+1)$}\\ 0 & \mbox{otherwise} \end{array}\right. \end{equation} with $\Delta x=x_j-x_i$ and $\Delta y=y_j-y_i$. When $g(i,j)=0$, we do not know if there is a gap due to binning. Meanwhile, $g(i,j)>0$ indicates a definitive insertion to the genome and $g(i,j)<0$ indicates a definitive deletion. Given a list of anchors sorted by genomic position $x$, let $f(j)$ be the maximal chaining score up to the $j$-th anchor in the list. $f(j)$ can be calculated with \begin{equation} f(j)=\max\big\{\max_{1\le i