The NCBI BLAST databases are generated by the program \emph{formatdb}. The three files needed by Easel are index file, sequence file and header file.For protein databases these files end with the extension ".pin", ".psq" and ".phr" respectively. For DNA databases the extensions are ".nin", ".nsq" and ".nhr" respectively. The index file contains information about the database, i.e. version number, database type, file offsets, etc. The sequence file contains residues for each of the sequences. Finally, the header file contains the header information for each of the sequences. This document describes the structure of the NCBI version 4 database. If these files cannot be found, an alias file, extensions ".nal" or ".pal", is processed. The alias file is used to specify multiple volumes when databases are larger than 2 GB. \subsection{Index File (*.pin, *.nin)} The index file contains format information about the database. The layout of the version 4 index file is below: \bigskip \begin{center} \begin{tabular}{|l|l|p{3.5in}|} \hline Version & Int32 & Version number. \\ \hline Database type & Int32 & 0-DNA 1-Protein. \\ \hline Title length & Int32 & Length of the title string (\emph{T}). \\ \hline Title & Char[\emph{T}] & Title string. \\ \hline Timestamp length & Int32 & Length of the timestamp string (\emph{S}). \\ \hline Timestamp &Char[\emph{S}] & Time of the database creation. The length of the timestamp \emph{S} is increased to force 8 byte alignment of the remaining integer fields. The timestamp is padded with NULs to achieve this alignment. \\ \hline Number of sequences & Int32 & Number of sequences in the database (\emph{N}) \\ \hline Residue count & Int64 & Total number of residues in the database. Note: Unlike other integer fields, this field is stored in little endian. \\ \hline Maximum sequence & Int32 & Length of the longest sequence in the database \\ \hline Header offset table & Int32[\emph{N+1}] & Offsets into the sequence's header file (*.phr, *nhr). \\ \hline Sequence offset table & Int32[\emph{N+1}] & Offsets into the sequence's residue file (*.psq, *.nsq). \\ \hline Ambiguity offset table & Int32[\emph{N+1}] & Offset into the sequence's residue file (*.nsq). Note: This table is only in DNA databases. If the sequence does not have any ambiguity residues, then the offset points to the beginning of the next sequence. \\ \hline \end{tabular} \end{center} \bigskip The integer fields are stored in big endian format, except for the residue count which is stored in little endian. The two string fields, timestamp and title are preceded by a 32 bit length field. The title string is not NUL terminated. If the end of the timestamp field does not end on an offset that is a multiple of 8 bytes, NUL characters are padded to the end of the string to bring it to a multiple of 8 bytes. This forces all the following integer fields to be aligned on a 4-byte boundary for performance reasons. The length of the timestamp field reflects the NUL padding if any. The header offset table is a list of offsets to the beginning of each sequence's header. These are offsets into the header file (*.phr, *.nhr). The size of the header can be calculated by subtracting the offset of the next header from the current header. The sequence offset table is a list of offsets to the beginning of each sequence's residue data. These are offsets into the sequence file (*.psq, *.nsq). The size of the sequence can be calculated by subtracting the offset of the next sequence from the current sequence. Since one more offset is stored than the number of sequences in the database, no special code is needed in calculating the header size or sequence size for the last entry in the database. \subsection{Protein Sequence File (*.pin)} The sequence file contains the sequences, one after another. The sequences are in a binary format separated by a NUL byte. Each residue is encoded in eight bits. \bigskip \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline Amino acid & Value & Amino acid & Value & Amino acid & Value & Amino acid & Value \\ \hline - & 0 & G & 7 & N & 13 & U & 24 \\ \hline A & 1 & H & 8 & O & 26 & V & 19 \\ \hline B & 2 & I & 9 & P & 14 & W & 20 \\ \hline C & 3 & J & 27 & Q & 15 & X & 21 \\ \hline D & 4 & K & 10 & R & 16 & Y & 22 \\ \hline E & 5 & L & 11 & S & 17 & Z & 23 \\ \hline F & 6 & M & 12 & T & 18 & * & 25 \\ \hline \end{tabular} \end{center} \subsection{DNA Sequence File (*.nsq)} The sequence file contains the sequences, one after another. The sequences are in a binary format but unlike the protein sequence file, the sequences are not separated by a NUL byte. The sequence is first compressed using two bits per residue then followed by an ambiguity correction table if necessary. If the sequence does not have an ambiguity table, the sequence's ambiguity index points to the beginning of the next sequence. \subsubsection{Two-bit encoding} The sequence is encoded first using two bits per nucleotide. \bigskip \begin{center} \begin{tabular}{|c|c|c|} \hline Nucleotide & Value & Binary \\ \hline A & 0 & 00 \\ \hline C & 1 & 01 \\ \hline G & 2 & 10 \\ \hline T or U & 3 & 11 \\ \hline \end{tabular} \end{center} \bigskip Any ambiguous residues are replaced by an 'A', 'C', 'G' or 'T' in the two bit encoding. To calculate the number of residues in the sequence, the least significant two bits in the last byte of the sequence needs to be examined. These last two bits indicate how many residues, if any, are encoded in the most significant bits of the last byte. \subsubsection{Ambiguity Table} To correct a sequence containing any degenerate residues, an ambiguity table follows the two bit encoded string. The start of the ambiguity table is pointed to by the ambiguity table index in the index file, "*.nin". The first four bytes contains the number of 32 bit words in the correction table. If the most significant bit is set in the count, then two 32 bit entries will be used for each correction. The 64 bit entries are used for sequence with more than 16 million residues. Each correction contains three pieces of information, the actual encoded nucleotide, how many nucleotides in the sequence are replaced by the correct nucleotide and finally the offset into the sequences to apply the correction. For 32 bit entries, the first 4 most significant bits encodes the nucleotide. Their bit pattern is true of their representation, i.e. the value of 'H' is equal to ('A'~or~'T'~or~'C'). \bigskip \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline Nucleotide & Value & Nucleotide & Value & Nucleotide & Value & Nucleotide & Value \\ \hline - & 0 & G & 4 & T & 8 & K & 12 \\ \hline A & 1 & R & 5 & W & 9 & D & 13 \\ \hline C & 2 & S & 6 & Y & 10 & B & 14 \\ \hline M & 3 & V & 7 & H & 11 & N & 15 \\ \hline \end{tabular} \end{center} \bigskip The next field is the repeat count which is four bits wide. One is added to the count giving it the range of 1 -- 256. The last 24 bits is the offset into the sequence where the replacement starts. The first residue start at offset zero, the second at offset one, etc. With a 24 bit size, the offset can only address sequences around 16 million residues long. To address larger sequences, 64 bit entries are used. For 64 bit entries, the order of the entries stays the same, but their sizes change. The nucleotide remains at four bits. The repeat count is increased to 12 bits giving it the range of 1 -- 4096. The offset size is increased to 48 bits. \subsection{Header File (*.phr, *.nhr)} The header file contains the headers for each sequence, one after another. The sequences are in a binary encoded ASN.1 format. The length of a header can be calculated by subtracting the offset of the next sequence from the current sequence offset. The ASN.1 definition for the headers can be found in the NCBI toolkit in the following files: asn.all and fastadl.asn. The parsing of the header can be done with a simple recursive descent parser. The five basic types defined in the header are: \begin{itemize} \item Integer -- a variable length integer value. \item VisibleString -- a variable length string. \item Choice -- a union of one or more alternatives. \item Sequence -- an ordered collection of one or more types. \item SequenceOf -- an ordered collection of zero or more occurrences of a given type. \end{itemize} \subsubsection{Integer} The first byte of an encoded integer is a hex \verb+02+. The next byte is the number of bytes used to encode the integer value. The remaining bytes are the actual value. The value is encoded most significant byte first. \subsubsection{VisibleString} The first byte of a visible string is a hex \verb+1A+. The next byte starts encoding the length of the string. If the most significant bit is off, then the lower seven bits encode the length of the string, i.e. the string has a length less than 128. If the most significant bit is on, then the lower seven bits is the number of bytes that hold the length of the string, then the bytes encoding the string length, most significant bytes first. Following the length are the actual string characters. The strings are not NUL terminated. \subsubsection{Choice} The first byte indicates which selection of the choice. The choices start with a hex value \verb+A0+ for the first item, \verb+A1+ for the second, etc. The selection is followed by a hex \verb+80+. Two NUL bytes follow the choice. \subsubsection{Sequence} The first two bytes are a hex \verb+3080+. The header is then followed by the encoded sequence types. The first two bytes indicates which type of the sequence is encoded. This index starts with the hex value \verb+A080+ for the first item, \verb+A180+ for the second, etc. then followed by the encoded item and finally two NUL bytes, \verb+0000+, to indicate the end of that type. The next type in the sequence is then encoded. If an item is optional and is not defined, then none of it is encoded including the index and NUL bytes. This is repeated until the entire sequence has been encoded. Two NUL bytes then mark the end of the sequence. \subsubsection{SequenceOf} The first two bytes are a hex \verb+3080+. Then the lists of objects are encoded. Two NUL bytes encode the end of the list.