Stockholm format was developed by the Pfam Consortium to support extensible markup and annotation of multiple sequence alignments. Why yet another alignment file format? Most importantly, the existing formats of popular multiple alignment software (e.g. CLUSTAL, GCG MSF, PHYLIP) do not support rich documentation and markup of the alignment. And since there is not yet a standard accepted format for multiple sequence alignment files, we don't feel too guilty about inventing a new one. \subsection{A minimal Stockholm file} \begin{cchunk} # STOCKHOLM 1.0 seq1 ACDEF...GHIKL seq2 ACDEF...GHIKL seq3 ...EFMNRGHIKL seq1 MNPQTVWY seq2 MNPQTVWY seq3 MNPQT... // \end{cchunk} The first line in the file must be \ccode{\# STOCKHOLM x.y}, where \ccode{x.y} is a major/minor version number for the format specification. This line allows a parser to instantly identify the file format. There is currently only one version of Stockholm format, \ccode{1.0}. In the alignment, each line contains a name followed by the aligned sequence. Neither the name nor the aligned sequence may contain whitespace characters. Stockholm does not enforce any other character conventions on the name or the aligned sequence. Typically, gaps (indels) are indicated in an aligned sequence by a dash or period, but Stockholm format does not require this. If the alignment is too long to fit on one line, the alignment may be split into multiple blocks, with blocks separated by blank lines. The number of sequences, their order, and their names must be the same in every block. Within a given block, each (sub)sequence (and any associated \ccode{\#=GR} and \ccode{\#=GC} markup, see below) is of equal length, called the \emph{block length}. Block lengths may differ from block to block; the block length must be at least one residue, and there is no maximum. Any line starting with a \ccode{\#} is considered to be a comment, and is ignored. Other blank lines in the file are ignored. All other annotation is added using a tag/value comment style. The tag/value format is inherently extensible, and readily made backwards-compatible; unrecognized tags will simply be ignored. Extra annotation includes consensus and individual RNA or protein secondary structure, sequence weights, a reference coordinate system for the columns, and database source information including name, accession number, and coordinates (for subsequences extracted from a longer source sequence) See below for details. It is usually easy to convert other alignment formats into a least common denominator Stockholm format. For instance, SELEX, GCG's MSF format, and the output of the CLUSTALW multiple alignment program are all closely related interleaved formats. \subsection{Syntax of Stockholm markup} There are four types of Stockholm markup annotation, for per-file, per-sequence, per-column, and per-residue annotation: \begin{sreitems}{\emcode{\#=GR <.....s.....>}} \item [\emcode{\#=GF }] Per-file annotation. \ccode{} is a free format text line of annotation type \ccode{}. For example, \ccode{\#=GF DATE April 1, 2000}. Can occur anywhere in the file, but usually all the \ccode{\#=GF} markups occur in a header. \item [\emcode{\#=GS }] Per-sequence annotation. \ccode{} is a free format text line of annotation type \ccode{} associated with the sequence named \ccode{}. For example, \ccode{\#=GS seq1 SPECIES\_SOURCE Caenorhabditis elegans}. Can occur anywhere in the file, but in single-block formats (e.g. the Pfam distribution) will typically follow on the line after the sequence itself, and in multi-block formats (e.g. HMMER output), will typically occur in the header preceding the alignment but following the \ccode{\#=GF} annotation. \item [\emcode{\#=GC <..s..>}] Per-column annotation. \ccode{<..s..>} is an aligned text line of annotation type \ccode{}. \ccode{\#=GC} lines are associated with a sequence alignment block; \ccode{<..s..>} is aligned to the residues in the alignment block, and has the same length as the rest of the block. Typically \ccode{\#=GC} lines are placed at the end of each block. \item [\emcode{\#=GR <..s..>}] Per-residue annotation. \ccode{<..s..>} is an aligned text line of annotation type \ccode{}, associated with the sequence named \ccode{}. \ccode{\#=GR} lines are associated with one sequence in a sequence alignment block; \ccode{<..s..>} is aligned to the residues in that sequence, and has the same length as the rest of the block. Typically \ccode{\#=GR} lines are placed immediately following the aligned sequence they annotate. \end{sreitems} \subsection{Semantics of Stockholm markup} Any Stockholm parser will accept syntactically correct files, but is not obligated to do anything with the markup lines. It is up to the application whether it will attempt to interpret the meaning (the semantics) of the markup in a useful way. At the two extremes are the Belvu alignment viewer and the HMMER profile hidden Markov model software package. Belvu simply reads Stockholm markup and displays it, without trying to interpret it at all. The tag types (\ccode{\#=GF}, etc.) are sufficient to tell Belvu how to display the markup: whether it is attached to the whole file, sequences, columns, or residues. HMMER uses Stockholm markup to pick up a variety of information from the Pfam multiple alignment database. The Pfam consortium therefore agrees on additional syntax for certain tag types, so HMMER can parse some markups for useful information. This additional syntax is imposed by Pfam, HMMER, and other software of mine, not by Stockholm format per se. You can think of Stockholm as akin to XML, and what my software reads as akin to an XML DTD, if you're into that sort of structured data format lingo. The Stockholm markup tags that are parsed semantically by my software are as follows: \subsubsection{Recognized \#=GF annotations} \begin{sreitems}{\emcode{TC }} \item [\emcode{ID }] Identifier. \ccode{} is a name for the alignment; e.g. ``rrm''. One word. Unique in file. \item [\emcode{AC }] Accession. \ccode{} is a unique accession number for the alignment; e.g. ``PF00001''. Used by the Pfam database, for instance. Often a alphabetical prefix indicating the database (e.g. ``PF'') followed by a unique numerical accession. One word. Unique in file. \item [\emcode{DE }] Description. \ccode{} is a free format line giving a description of the alignment; e.g. ``RNA recognition motif proteins''. One line. Unique in file. \item [\emcode{AU }] Author. \emcode{} is a free format line listing the authors responsible for an alignment; e.g. ``Bateman A''. One line. Unique in file. \item [\emcode{GA }] Gathering thresholds. Two real numbers giving HMMER bit score per-sequence and per-domain cutoffs used in gathering the members of Pfam full alignments. See Pfam and HMMER documentation for more detail. \item [\emcode{NC }] Noise cutoffs. Two real numbers giving HMMER bit score per-sequence and per-domain cutoffs, set according to the highest scores seen for unrelated sequences when gathering members of Pfam full alignments. See Pfam and HMMER documentation for more detail. \item [\emcode{TC }] Trusted cutoffs. Two real numbers giving HMMER bit score per-sequence and per-domain cutoffs, set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of Pfam full alignments. See Pfam and HMMER documentation for more detail. \end{sreitems} \subsection{Recognized \#=GS annotations} \begin{sreitems}{\emcode{WT }} \item [\emcode{WT }] Weight. \ccode{} is a nonnegative real number giving the relative weight for a sequence, usually used to compensate for biased representation by downweighting similar sequences. Usually the weights average 1.0 (e.g. the weights sum to the number of sequences in the alignment) but this is not required. Either every sequence must have a weight annotated, or none of them can. \item [\emcode{AC }] Accession. \ccode{} is a database accession number for this sequence. (Contrast to \ccode{\#=GF AC} markup, which gives an accession for the whole alignment.) One word. \item [\emcode{DE }] Description. \ccode{} is one line giving a description for this sequence. (Contrast to \ccode{\#=GF DE} markup, which gives a description for the whole alignment.) \end{sreitems} \subsection{Recognized \#=GC annotations} \begin{sreitems}{\emcode{SA\_cons}} \item [\emcode{RF}] Reference line. Any character is accepted as a markup for a column. The intent is to allow labeling the columns with some sort of mark. \item [\emcode{SS\_cons}] Secondary structure consensus. For protein alignments, DSSP codes or gaps are accepted as markup: \ccode{[HGIEBTSCX.-\_]}, where H is alpha helix, G is 3/10-helix, I is p-helix, E is extended strand, B is a residue in an isolated b-bridge, T is a turn, S is a bend, C is a random coil or loop, and X is unknown (for instance, a residue that was not resolved in a crystal structure). For RNA alignments, the annotation is in WUSS format. Minimally, the symbols \ccode{<} and \ccode{>} indicate a base pair, \ccode{.} indicate single-stranded positions, and RNA pseudoknots are represented by alphabetic characters, with upper case letters representing the 5' side of the helix and lower case letters representing the 3' side. Note that this limits the annotation to a maximum of 26 pseudoknots per sequence. \item [\emcode{SA\_cons}] Surface accessibility consensus. 0-9, gap symbols, or X are accepted as markup. 0 means $<$10\% accessible residue surface area, 1 means $<$20\%, 9 means $<$100\%, etc. X means unknown structure. \end{sreitems} \subsection{Recognized \#=GR annotations} \begin{sreitems}{\emcode{SA}} \item [\emcode{SS}] Secondary structure consensus. See \ccode{\#=GC SS\_cons} above. \item [\emcode{SA}] Surface accessibility consensus. See \ccode{\#=GC SA\_cons} above. \end{sreitems}