The \eslmod{alphabet} module contains routines for digitizing alphabetic biosequences. It is convenient to represent nucleotides and amino acids as array indices 0..3 or 0..19, respectively, for efficiency and other reasons. It is also convenient to index the residues in a biosequence in 1..L coordinates instead of the C language's 0..L-1 array representation, partly for human readability, and also because some codes (dynamic programming alignment algorithms, for example) have boundary conditions where initializing a boundary at coordinate 0 is convenient. Real biosequences do not consist of just four or twenty different canonical symbols, though. The \eslmod{alphabet} module provides mechanisms for dealing with several other biosequence coding issues: \begin{itemize} \item Degenerate residue symbols representing uncertainties, including both IUPAC/IUBMB standard one-letter nomenclature and nonstandard extensions such as the use of \ccode{J} to mean isoleucine or leucine (\ccode{I|L}) in protein sequences determined by mass spec; \item Symbols for ``any residue'' (N in DNA/RNA; X in amino acid sequences) and ``not a residue'' (a ``translated'' stop codon '*' in protein sequence; \item Standard and nonstandard symbols for unusual residues, such as selenocysteine (\ccode{U}) and pyrrolysine (\ccode{O}) in proteins; \item \emph{Ad hoc} symbols representing modified residues, such as the slew of one-letter codes used to annotate posttranscriptionally modified nucleotides in the Sprinzl tRNA database \citep{Sprinzl98}; \item Case-insensitivity of input sequences, for instance allowing both \ccode{a} and \ccode{A} to mean alanine in amino acid sequences; \item Tolerating common malpractices in the field, like the use of \ccode{X} instead of \ccode{N} as a degeneracy code in nucleic acid sequence; \item The semantic difference between a gap symbol and a missing data symbol in sequence alignments. \end{itemize} The \eslmod{alphabet} module provides standard defaults for protein, RNA, and DNA alphabets which follow both community standards and IUPAC/IUBMB nomenclature for representing sequence residues in one-letter ASCII characters. Additionally, the design of the \eslmod{alphabet} module is flexible enough to allow an application to customize its own alphabet, to deal with these issues almost any way it chooses. Table~\ref{tbl:alphabet_api} lists the functions in the \eslmod{alphabet} API. Easel maintains alphabet information in an \ccode{ESL\_ALPHABET} structure. An application usually creates its alphabet once, possibly even as a global variable. A digitized sequence \ccode{dsq} is an \ccode{unsigned char *} array of length \ccode{L+2}, where \ccode{dsq[1..L]} are digitized residues, and \ccode{dsq[0]} and \ccode{dsq[L+1]} are sentinel bytes (of value \ccode{eslSENTINEL}, 127). \begin{table}[hbp] \begin{center} \begin{tabular}{ll}\hline \multicolumn{2}{c}{\textbf{The \ccode{ESL\_ALPHABET} object}}\\ \ccode{esl\_alphabet\_Create()} & Create alphabet of standard type. \\ \ccode{esl\_alphabet\_CreateCustom()} & Create a custom alphabet. \\ \ccode{esl\_alphabet\_SetEquiv()} & Define an equivalent symbol. \\ \ccode{esl\_alphabet\_SetCaseInsensitive()} & Make an alphabet's input map case insensitive. \\ \ccode{esl\_alphabet\_SetDegeneracy()} & Define degenerate symbol in custom alphabet. \\ \ccode{esl\_alphabet\_Destroy()} & Frees an alphabet object. \\ \multicolumn{2}{c}{\textbf{Digitized sequences}}\\ \ccode{esl\_abc\_CreateDsq()} & Digitize a sequence into new \ccode{dsq} space. \\ \ccode{esl\_abc\_Digitize()} & Digitize a sequence into existing \ccode{dsq} space. \\ \multicolumn{2}{c}{\textbf{Other functions}}\\ \ccode{esl\_abc\_\{I,F,D\}AvgScore()} & Calculate avg score of degenerate residue.\\ \ccode{esl\_abc\_\{I,F,D\}ExpectScore()} & Calculate expected score of degenerate residue.\\ \ccode{esl\_abc\_Type()} & Convert internal alphabet type code to output string.\\ \ccode{esl\_abc\_\{F,D\}Count()} & Count a digital symbol towards a countvector.\\ \ccode{esl\_abc\_DigitizeSymbol()} & Returns digital code for one ASCII character.\\ \ccode{esl\_abc\_XIsDegenerate()} & Returns TRUE given code for a degeneracy.\\ \ccode{esl\_abc\_XIsBasic()} & Returns TRUE given code for a fundamental residue.\\ \ccode{esl\_abc\_XIsGap()} & Returns TRUE given code for a gap.\\ \ccode{esl\_abc\_CIsDegenerate()} & Returns TRUE given a degenerate character.\\ \ccode{esl\_abc\_CIsBasic()} & Returns TRUE given a fundamental residue.\\ \ccode{esl\_abc\_CIsGap()} & Returns TRUE given a gap character.\\ \hline \end{tabular} \end{center} \caption{The \eslmod{alphabet} API.} \label{tbl:alphabet_api} \end{table} \subsection{An example of using the alphabet API} Figure~\ref{fig:alphabet_example} shows an example of creating a DNA alphabet and digitizing a short DNA sequence. \begin{figure} \input{cexcerpts/alphabet_example} \caption{An example of using the \eslmod{alphabet} module.} \label{fig:alphabet_example} \end{figure} \begin{itemize} \item A standard biosequence alphabet is created using \ccode{esl\_alphabet\_Create(type)}, where \ccode{type} can be \ccode{eslDNA}, \ccode{eslRNA}, or \ccode{eslAMINO}. \item An input sequence \ccode{seq} of length \ccode{L} is digitized according to alphabet \ccode{a}, creating a newly allocated digital sequence \ccode{dsq}, by calling \ccode{esl\_dsq\_Create(a, seq, L, \&dsq)}. The caller must free \ccode{dsq} using \ccode{free(dsq)}. Alternatively, if the caller has already allocated \ccode{L+2} (or more) bytes in \ccode{dsq}, it can call \ccode{esl\_dsq\_Set(a, seq, L, dsq)}, which is the non-allocating version of \ccode{esl\_dsq\_Create()}. \item For an input sequence of length \ccode{L}, the digitized sequence \ccode{dsq} is a \ccode{char *} array of \ccode{L+2} bytes. \ccode{dsq[0]} and \ccode{dsq[L+1]} contain a sentinel byte of value \ccode{eslSENTINEL} (127). Positions \ccode{1..L} hold the residues, where values \ccode{0..3} encode \ccode{ACGT} in DNA sequences, \ccode{0..3} encode \ccode{ACGU} in RNA sequences, and \ccode{0..19} encode \ccode{AC..WY} in amino acid sequences. \item Both sequence-digitizing functions return \ccode{eslEINVAL} if the input sequence contains characters that are not in the alphabet. Because input sequences are often provided by a user (not the program), this is a common error that the application must check for. \end{itemize} \subsection{Concepts and terminology} A \esldef{symbol} is a 7-bit ASCII input character, representing a residue, gap, nonresidue, or degeneracy. A \esldef{code} is the digital internal representation of the symbol as an \ccode{unsigned char} in the range $0..127$, suitable for use as an array index. The \eslmod{alphabet} module translates input symbols into internal digital codes. We distinguish between an input alphabet, an internal alphabet, and a canonical alphabet. The \esldef{input alphabet} consists of all the symbols that Easel allows in an input biosequence character string. The \esldef{internal alphabet} is the standardized one-letter alphabet that Easel deals with. The \esldef{canonical alphabet} is the fundamental set of 4 nucleotides or 20 amino acids. Easel deals with all of the complications of sequence encoding using two concepts, equivalency and degeneracy. \esldef{Equivalency} defines how the input alphabet maps to the internal alphabet. \esldef{Degeneracy} defines how the internal alphabet maps to the canonical alphabet. Equivalent residues are symbols that are accepted in an input sequence character string and silently translated into an appropriate internal code. Characters in the input alphabet are mapped many-to-one to the internal alphabet using an \esldef{input map}. One use of equivalency is to map both lower and upper case input to the same internal symbol. Another use is to allow several different input characters to mean a gap, 'any' symbol, or 'nonresidue' symbol. Another use is to silently accept and ``fix'' nonstandard but common input ``errors'', such as tolerating the use of X to mean N in nucleic acid sequences. Degenerate residues are codes in the internal alphabet that are mapped one-to-many onto canonical residue codes, using a \esldef{degeneracy map}. In addition to mapping the degeneracy codes onto the canonical alphabet, the degeneracy mechanism is also used to deal with unusual and modified residues. Selenocysteine, for instance, is represented by default as a \ccode{U}, but treated as a degenerate code for \ccode{C} (cysteine). The rationale for this will be described in more detail below. \subsubsection{The internal alphabet} Easel's internal alphabet is a string (\ccode{a->sym}) of length \ccode{Kp}, which contains: \begin{itemize} \item the \ccode{K} symbols of the canonical alphabet; \item a standard gap symbol; \item (optionally) any other degenerate, unusual, or modified residue codes; \item a mandatory ``any'' symbol (a completely degenerate residue); \item a mandatory ``not-a-residue'' symbol; \item a mandatory ``missing data'' symbol. \end{itemize} Residues \ccode{0..K-1} must be the canonical alphabet. Residue \ccode{K} must be the gap symbol. Residues \ccode{K+1..Kp-4} must be the degenerate and modified residue symbols (there can be zero of these). Residue \ccode{Kp-3} must be the completely degenerate symbol (such as \ccode{X} for protein sequence or \ccode{N} for nucleic acid sequence); all alphabets must have such a symbol. Residue \ccode{Kp-2} must be the not-a-residue symbol. Residue \ccode{Kp-1} must be the missing data symbol. Because the 'any' symbol, 'not-a-residue' symbol, and the two kinds of gap symbols are mandatory in any alphabet, \ccode{Kp} $\geq$ \ccode {K+4}. Aside from these constraints, symbols may occur in any order. The digital code used for each residue is then the index of a residue in this string, \ccode{0..Kp-1}. The only other value that can appear in a digitized sequence is \ccode{eslSENTINEL} (127), the sentinel byte in positions \ccode{0} and \ccode{L+1} of a digitized sequence of length \ccode{L}. The rationale for the ordering is the following. Most applications will define residue scores in vectors and matrices that are smaller than the full range of the internal alphabet; for instance, it's common to only have \ccode{K} scores for the canonical residues. As much as possible, we want array indices to be the same whether we're accessing the full internal alphabet or a smaller score vector or matrix. So: we expect many applications to have score vectors or matrices that only contain the \ccode{K} canonical residues, so the canonical residues go first. We expect some applications to treat gaps as an extra symbol, and provide \ccode{K+1} position-specific scores or a \ccode{K+1} $\times$ \ccode{K+1} score matrix, so the gap character is next. We expect a few applications to optimize degeneracy scoring by precalculating them in \ccode{Kp-2} vectors or $Kp-2 \times Kp-2$ matrices, so the degeneracies go next (the gap character at $K$ might then go unused in the score vectors and matrices, but that's a minor inefficiency). The 'any' symbol should be at a predictable position in the degeneracy list, so it's arbitrarily placed at the end, in position \ccode{Kp-3}. The most robust applications will also handle the not-a-residue symbol (they may see translated stop codons), so it's next. Finally, the missing data symbol is expected to always require special handling when it occurs, rather than appearing in a score vector or matrix, so it's put last. \subsection{The standard alphabets: DNA, RNA, protein} The three standard internal alphabets are: \begin{table}[h] \begin{tabular}{llccrr} \textbf{Type} & \ccode{sym} & \textbf{equivs} & \textbf{gaps} & \ccode{K} & \ccode{Kp} \\ \ccode{eslRNA} & \ccode{ACGU-RYMKSWHBVDN*\~} & T=U; X=N & \ccode{-\_.} & 4 & 18 \\ \ccode{eslDNA} & \ccode{ACGT-RYMKSWHBVDN*\~} & U=T; X=N & \ccode{-\_.} & 4 & 18 \\ \ccode{eslAMINO} & \ccode{ACDEFGHIKLMNPQRSTVWY-BJZOUX*\~} & & \ccode{-\_.} & 20 & 29 \\ \end{tabular} \end{table} The \ccode{sym} string contains all the symbols that can be handled internally, and all the residues that can be represented when a digitized sequence is converted back to text. An application might still convert some characters for its own purposes before displaying an alphabetic string; for instance, to use different gap symbols for insertions versus deletions, or to use upper/lower case conventions to represent match/insert positions in a profile HMM alignment. The standard DNA and RNA alphabets follow published IUBMB recommendations (``Nomenclature for incompletely specified bases in nucleic acid'' \citep{IUBMB85}), with an addition of X as an equivalence for N (acquiescing to the \emph{de facto} BLAST filter standard of using X's to mask residues), and equivalencing T to U in RNA sequences (and vice versa in DNA). The one-letter code for amino acids follows section 3AA-21 of the IUPAC recommendations \citep{IUPAC84}. The code is augmented by U for selenocysteine, as recommended in 1999 by the JCBN/NC-IUBMB Newsletter (\url{http://www.chem.qmul.ac.uk/iubmb/newsletter/1999/item3.html}). It is also augmented by O for pyrrolysine and J for a leucine/isoleucine ambiguity (from a mass spectrometry experiment), following usage in the RESID database (\url{http://www.ebi.ac.uk/RESID/}). \subsection{Degenerate residues} The symbols from \ccode{K+1..Kp-4} in the internal alphabet are all treated as degenerate residues. When creating a custom alphabet, each degenerate symbol is initialized by calling \ccode{esl\_alphabet\_SetDegeneracy(alphabet, c, syms)} to assign degenerate alphabetic symbol \ccode{c} to the alphabetic symbols in the string \ccode{syms}. For example, \ccode{esl\_alphabet\_SetDegeneracy(a, 'R', \"AG\")} assigns R (purine) to mean A or G. For the standard biosequence alphabets, this is done automatically to define the proper degeneracy codes. For amino acid alphabets, the default code is: \begin{cchunk} esl_alphabet_SetDegeneracy(a, 'B', "ND"); esl_alphabet_SetDegeneracy(a, 'J', "IL"); esl_alphabet_SetDegeneracy(a, 'Z', "QE"); \end{cchunk} For RNA alphabets, the default code is: \begin{cchunk} esl_alphabet_SetDegeneracy(a, 'R', "AG"); esl_alphabet_SetDegeneracy(a, 'Y', "CU"); esl_alphabet_SetDegeneracy(a, 'M', "AC"); esl_alphabet_SetDegeneracy(a, 'K', "GU"); esl_alphabet_SetDegeneracy(a, 'S', "CG"); esl_alphabet_SetDegeneracy(a, 'W', "AU"); esl_alphabet_SetDegeneracy(a, 'H', "ACU"); esl_alphabet_SetDegeneracy(a, 'B', "CGU"); esl_alphabet_SetDegeneracy(a, 'V', "ACG"); esl_alphabet_SetDegeneracy(a, 'D', "AGU"); \end{cchunk} For DNA alphabets, the calls are is the same as for RNA code, but with \ccode{T} in place of \ccode{U}. \subsubsection{Implementation: the degeneracy map} The alphabet's degeneracy map is implemented in an array \ccode{a->degen[0..Kp-1][0..K-1]} of 1/0 (TRUE/FALSE) flags. \ccode{a->degen[x][y] == TRUE} indicates that the residue set $D(x)$ for degeneracy code \ccode{x} contains base residue \ccode{y}. \ccode{a->ndegen[x]} contains the cardinality $|D(x)|$, how many base residues are represented by degeneracy code \ccode{x}. For the two kinds of gap symbols, the degeneracy map is empty; all flags are FALSE and the cardinality is 0. Because character \ccode{Kp-3} in the internal alphabet is automatically assumed to be an ``any'' character (such as 'N' for DNA or RNA, 'X' for protein), \ccode{a->degen[Kp-3][i] = 1} for all $i=0..K-1$, and \ccode{a->ndegen[Kp-3] = K}. The storage of the degeneracy map is a little wasteful. We really only need rows \ccode{a->degen[K+1..Kp-3]}, but optimizing this would create some index translation hassles, and it doesn't seem worth it. \subsection{Equivalent residues} The concept of equivalent residues allows an input symbol to be mapped to a different internal symbol. One use of equivalence is to map both lower and upper case input to the same internal representation. Another use is to allow several different input characters to mean a gap. Another use is to silently accept and ``fix'' nonstandard but common input ``errors'', such as the use of T instead of U in RNA sequences (or vice versa in DNA), or the use of X instead of N as an ambiguity code in nucleic acid sequences. The call \ccode{esl\_alphabet\_SetEquiv(a, 'U', 'T')}, for example, makes an alphabet interpret \ccode{U} as a \ccode{T} (encoding both as \ccode{3}, in the case of the standard DNA and RNA alphabets). All three standard alphabets accept \ccode{\_} or \ccode{.} symbols as equivalences for the standard gap symbol \ccode{-}. An application can define additional gap characters, such as \ccode{,}, by calling \ccode{esl\_alphabet\_SetSynonym(a, ',', '-')} on one of the standard alphabets to define additional equivalences (that is, you don't have to create a custom alphabet to add new equivalences). \ccode{esl\_alphabet\_SetCaseInsensitive()} maps both upper case and lower case input alphabetic characters map to their equivalent in the internal alphabet in a case-insensitive manner. This function works only on residues that have already been declared to be part of the alphabet, so when defining a custom alphabet, it must be called after all individual equivalences have been defined. The standard alphabets are always set to be case insensitive. \subsubsection{Implementation of equivalent residues: the input map} Internally, an \textbf{input map}, \ccode{a->inmap[0..127]}, specifies how an input ASCII 7-bit text symbol is converted to digital code. \ccode{a->inmap['T'] = 3} in the standard DNA alphabet, for example, and the call \ccode{esl\_alphabet\_SetSynonym(a, 'U', 'T')} sets \ccode{a->inmap['U'] = a->inmap['T']}. The elements in input maps are of type \ccode{unsigned char}. Legal values are 0..127 (values that can be cast to the \ccode{unsigned char} codes in a digitized sequence) and two additional flags with negative values, \ccode{eslILLEGAL\_CHAR} (255) and \ccode{eslIGNORED\_CHAR} (254). \subsection{Unusual or modified residues} In addition to the canonical 4 or 20 residues and their ambiguity codes, there are many unusual and/or modified residues. For instance, there are many posttranscriptional or posttranslational modifications on residues in RNAs and proteins. Some databases try to capture this information in a single-letter alphabetic code, such as the Sprinzl transfer RNA database \cite{Sprinzl98}. Additionally, and perhaps more importantly, proteins are known to contain at least two additional genetically encoded amino acids, selenocysteine and pyrrolysine. Selenocysteine is represented by a \ccode{U} according to the IUPAC standard, and pyrrolysine is represented by a \ccode{O} in the RESID database at EBI. Unusual one-letter residue codes pose a tradeoff issue for sequence analysis applications. On the one hand, an application should recognize symbols for unusual or modified residues, and be able to represent them both internally and in any sequence output. For example, no application should read an input selenocysteine residue (\ccode{U}) and output it as a cysteine (\ccode{C}) -- this changes the original sequence and causes data corruption.\footnote{However, at least one the main public protein databases (UniProt) has already chosen to replace all selenocysteines with \ccode{C} and all pyrrolysines with \ccode{K}, for fear of breaking legacy sequence analysis software. So, this data corruption is already a fact of life.} On the other hand, most sequence analysis applications would not want to take the trouble to define a canonical alphabet larger than the usual 4 or 20 residues, and then have to parameterize that alphabet, just to be able to handle a few rare residues. (Pyrrolysine, for example, has only been found in a handful of proteins in a few Archaea.) It is useful to be able to deal with probability parameters and scores only for the canonical alphabet. However (on yet another hand!) in some cases one \emph{would} want to write a specialized application that parameterizes unusual residues as part of its canonical alphabet -- for instance, an application for analyzing posttranscriptional tRNA modifications, for example. Therefore, Easel must not force an input selenocysteine or pyrrolysine (or any other unusual residue) to be recoded as an arbitrary symbol (such as cysteine or lysine). That is, unusual symbols cannot be treated as equivalences, but must be allowed to be part of the internal alphabet. However, Easel \emph{can} allow unusual symbols to be treated as noncanonical, and \emph{score} them as some other arbitrary residue, as a reasonable approximation. Thus for most purposes, unusual symbols are best handled as a special kind of degeneracy, with a one-to-one degeneracy map from the unusual symbol to the ``closest'' canonical residue. Therefore, the default amino acid alphabet accepts selenocysteine (\ccode{U}) and pyrrolysine symbols (\ccode{O}) and represents them in the internal alphabet, and maps them as ``degeneracies'' onto cysteine (\ccode{C}) and lysine (\ccode{K}), respectively. When that behavior is not suitable, an application can also define any custom alphabet it chooses, as described below. \subsection{Creating a custom alphabet} Figure~\ref{fig:alphabet_example2} shows an example of creating a customized 22-letter amino acid alphabet that includes the \ccode{U} selenocysteine code and the \ccode{O} pyrrolysine code. \begin{figure} \input{cexcerpts/alphabet_example2} \caption{An example of creating a custom alphabet.} \label{fig:alphabet_example2} \end{figure} \subsection{Scoring degenerate residues} To score a degenerate residue code $x$, Easel provides two strategies. One set of functions assigns an average score: \[ S(x) = \frac{\sum_{y \in D(x)} S(y) } { |D(x)| }, \] where $D(x)$ is the set of residues $y$ represented by degeneracy code $x$ (for example, $D(\mbox{R}) = \{ \mbox{A,G} \}$), $| D(x) |$ is the number of residues that the degeneracy code includes, and $S(y)$ is the score of a base residue $y$. Because scores $S(y)$ are commonly kept as integers, floats, or doubles, depending on the application, three functions are provided that differ only in the storage type of the scores: \ccode{esl\_abc\_IAvgScore(a,x,sc)}, \ccode{esl\_abc\_FAvgScore(a,x,sc)}, and \ccode{esl\_abc\_DAvgScore(a,x,sc)} calculate and return the average score of residue \ccode{x} in alphabet \ccode{a} given a base score vector \ccode{sc[0]..sc[K-1]} for integers, floats, and doubles, respectively. A second set of functions assigns an expected score, weighted by an expected occurrence probability $p_y$ of the residues $y$ (often the random background frequencies): \[ S(x) = \frac{\sum_{y \in D(x)} p_y S(y) } { \sum_{y \in D(x)} p_y }, \] These three functions are \ccode{esl\_abc\_IExpectScore(a,x,sc,p)}, \ccode{esl\_abc\_FExpectScore(a,x,sc,p)}, and \ccode{esl\_abc\_DExpectScore(a,x,sc,p)}. For the integer and float versions, the probability vector is in floats; for the double version, the probability vector is in doubles. For efficiency reasons, an application might choose to preculate scores for all possible degenerate codes it might see. HMMER, for example, turns probability vectors of length \ccode{K} into score residues of length \ccode{Kp}. An application might also choose to score residues on-the-fly, using score vectors of length \ccode{K}. Each input residue \ccode{x} would then have to be tested to see if it is degenerate, before scoring it appropriately. \ccode{esl\_abc\_IsBasic(a, x)} returns \ccode{TRUE} if \ccode{x} is in the basic set of \ccode{K} residues in alphabet \ccode{a}, and \ccode{FALSE} otherwise. Similarly, \ccode{esl\_abc\_IsGap(a,x)} tests whether $x$ is a gap, and \ccode{esl\_abc\_IsDegenerate(a,x)} tests whether $x$ is a degenerate residue.