The \eslmod{buffer} module provides an abstract layer for building input parsers. Different types of input -- including files, standard input, piped output from executed commands, C strings, and raw memory -- can be handled efficiently in a single API and a single object, an \ccode{ESL\_BUFFER}. %The API is summarized in Table~\ref{tbl:buffer_api}. The main rationale for \eslmod{buffer} is to enable multipass parsing of any input, even a nonrewindable stream or pipe. A canonical problem in sequence file parsing is that we need to know both the format (FASTA or Genbank, for instance) and the alphabet (protein or nucleic acid, for instance) in order to parse Easel-digitized sequence data records. To write ``smart'' parsers that automagically determine the file format and alphabet, so programs work transparently on lots of different file types without users needing to specify them, we need three-pass parsing: one pass to read raw data and determine the format, a second pass to parse the format for sequence data and determine its alphabet, and finally the actual parsing of digitized sequences. Multiple pass parsing of a nonrewindable stream, such as standard input or the output of a \ccode{gunzip} call, isn't possible without extra support. The \eslmod{buffer} module standardizes that support for all Easel input. \subsection{Examples of using the buffer API} Here's an example of using \eslmod{buffer} to read a file line by line: \input{cexcerpts/buffer_example} This shows how to open an input, get each line sequentially, do something to each line (here, count the number of x's), and close the input. To compile this example, then run it on a file (any file would do, but here, \ccode{esl\_buffer.c} itself): \user{gcc -I. -o esl\_buffer\_example -DeslBUFFER\_EXAMPLE esl\_buffer.c easel.c -lm} \user{./esl\_buffer\_example esl\_buffer.c} \response{Counted 181 x's in 3080 lines.} The most important thing to notice here is that \ccode{esl\_buffer\_Open()} function implements a standard Easel idiom for finding input sources. If the \ccode{filename} argument is a single dash '-', it will read from \ccode{stdin}. If the \ccode{filename} argument ends in \ccode{.gz}, it will assume the file is a \ccode{gzip}-compressed input, and it will decompress it on the fly with \ccode{gzip -dc} before reading it. If it does not find the \ccode{filename} relative to the current directory, and if the second argument (here \ccode{"TESTDIR"}) is non-\ccode{NULL}, it looks at the setting of an environment variable \ccode{envvar}, which should contain a colon-delimited list of directories to search to try to find \ccode{filename}. Therefore all of the following commands will work and give the same result: \begin{userchunk} % ./esl_buffer_example esl_buffer.c \end{userchunk} \begin{userchunk} % cat esl_buffer.c | ./esl_buffer_example - \end{userchunk} \begin{userchunk} % cp esl_buffer.c foo % gzip foo % ./esl_buffer_example foo.gz \end{userchunk} \begin{userchunk} % cp esl_buffer.c ${HOME}/mydir2/baz % export TESTDIR=${HOME}/mydir1:${HOME}/mydir2 % ./esl_buffer_example baz \end{userchunk} This idiomatic flexibility comes in handy when using biological data. Data are are often kept in standard directories on systems (for example, we maintain a symlink \ccode{/misc/data0/databases/Uniprot} on ours), so having applications look for directory path listings in standardized environment variables can help users save a lot of typing of long paths. Data files can be big, so it's convenient to be able to compress them and not have to decompress them to use them. It's convenient to have applications support the power of using UNIX command invocations in pipes, chaining the output of one command into the input of another, so it's nice to automatically have any Easel-based application read from standard input. A couple of other things to notice about this example: \begin{enumerate} \item If the \ccode{esl\_buffer\_Open()} fails, it still returns a valid \ccode{ESL\_BUFFER} structure, which contains nothing except a user-directed error message \ccode{bf->errmsg}. If you were going to continue past this error, you'd want to \ccode{esl\_buffer\_Close()} the buffer. \item \ccode{esl\_buffer\_GetLine()} returns a pointer to the start of the next line \ccode{p}, and its length in chars \ccode{n} (exclusive of any newline character). It does \emph{not} return a string - \ccode{p[n]} is \emph{not} a \ccode{NUL} byte \verb+\0+. Standard C string functions, which expect \ccode{NUL}-terminated strings, can't be used on \ccode{p}. The reason is efficiency: the \ccode{ESL\_BUFFER} is potentially looking at a read-only exact image of the input, and \ccode{esl\_buffer\_GetLine()} is not wasting any time making a copy of it. If you need a string, with an appended \verb+\0+ in the right place, see \ccode{esl\_buffer\_FetchLineAsStr()}. \end{enumerate} \subsubsection{Reading tokens} Because \ccode{ESL\_BUFFER} prefers to give you pointers into a read-only image of the input, the standard C \ccode{strtok()} function can't be used to define tokens (whitespace-delimited fields, for example), because \ccode{strtok()} tries to write a \verb+\0+ byte after each token it defines. Therefore \ccode{ESL\_BUFFER} provides its own token parsing mechanism. Depending on whether or not you include newline characters (\verb+\r\n+) in the list of separator (delimiter) characters, it either ignores newlines altogether, or it detects newlines separately and expects to find a known number of tokens per line. For example, our x counting program could be implemented to parse every token instead of every line: \input{cexcerpts/buffer_example2} \user{gcc -I. -o esl\_buffer\_example2 -DeslBUFFER\_EXAMPLE2 esl\_buffer.c easel.c -lm} \user{./esl\_buffer\_example2 esl\_buffer.c} \response{Counted 181 x's in 14141 words.} In the \ccode{esl\_buffer\_GetToken()} call, including \verb+\r\n+ with \verb+" \t"+ in the separators causes newlines to be treated like delimiters like any space or tab character. If you omit \verb+\r\n+ newline characters from the separators, then the parser detects them specially anyway; when it sees a newline instead of a token, it returns \ccode{eslEOL} and sets the point to the next character following the newline. For example, we can count both lines and tokens: \input{cexcerpts/buffer_example3} \user{gcc -I. -o esl\_buffer\_example3 -DeslBUFFER\_EXAMPLE3 esl\_buffer.c easel.c -lm} \user{./esl\_buffer\_example3 esl\_buffer.c} \response{Counted 181 x's in 14141 words on 3080 lines.} What happens if the last line in a text file is missing its terminal newline? In the example above, the number of lines would be one fewer; the nonterminated last line wouldn't be counted. \ccode{esl\_buffer\_GetToken()} would return \ccode{eslEOF} on the last line of the file, rather than \ccode{eslEOL} followed by \ccode{eslEOF} at its next call as it'd do if the newline were there. \subsubsection{Reading fixed-width binary input} You can also read fixed-width binary input directly into storage, including scalar variables, using the \ccode{esl\_buffer\_Read()} call. This is similar to C's \ccode{fread()}: \input{cexcerpts/buffer_example4} The \ccode{Read()} call needs to know exactly how many bytes \ccode{n} it will read. For variable-width binary input, see the \ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} calls. In fact all inputs are treated by \ccode{ESL\_BUFFER} as binary input. That is, platform-dependent newlines are not converted automatically to C \verb+\n+ characters, as would happen when using the C \ccode{stdio.h} library to read an input stream in ``text mode''. You can freely mix different types of \ccode{esl\_buffer\_*} parsing calls as you see appropriate. \subsubsection{A more complicated example, a FASTA parser} An example of a simple FASTA parsing function: \input{cexcerpts/buffer_example5a} and an example of using that function in a program: \input{cexcerpts/buffer_example5b} One thing to note here is the use of \ccode{esl\_buffer\_Set()} to push characters back into the parser. For example, when we look for the starting '>', we do a raw \ccode{esl\_buffer\_Get()}, look at the first character, then call \ccode{esl\_buffer\_Set()} with \ccode{nused=1} to tell the parser we used 1 character of what it gave us. This is an idiomatic usage of the \ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} pair. The \ccode{esl\_buffer\_Get()} call doesn't even move the point until the companion \ccode{esl\_buffer\_Set()} tells it where to move to. The other idiomatic use of \ccode{esl\_buffer\_Set()} is to implement a ``peek'' at a next line or a next token, using a \ccode{esl\_buffer\_GetLine()}/\ccode{esl\_buffer\_Set()} or \ccode{esl\_buffer\_GetToken()}/\ccode{esl\_buffer\_Set()} combination. You see this when we're in the sequence reading loop, we get a line, and we want to peek at its first character. If it's a '>' we're seeing the start of the next sequence, so we want to return while leaving the point on the '>'. To do this, we use \ccode{esl\_buffer\_GetLine()} to get the line, and if the first char is a '>' we use \ccode{esl\_buffer\_Set()} to push the line pointer (with 0 used characters) back to the parser. You can also see examples here of using \ccode{esl\_buffer\_FetchTokenAsStr()} \ccode{esl\_buffer\_FetchLineAsStr()} to copy the name and description directly to allocated, \verb+\0+-terminated C strings. Note how they interact: because \ccode{esl\_buffer\_FetchTokenAsStr()} moves the point past any trailing separator characters to the start of the next token, and because \ccode{esl\_buffer\_FetchLineAsStr()} doesn't need the point to be at the start of a line, the \ccode{esl\_buffer\_FetchLineAsStr()} call finds the description without leading spaces or trailing newline (but with any trailing spaces). \subsection{Using anchors: caller-defined limits on random access} The naive way to enable random access on a sequential stream is to slurp the whole stream into memory. If the stream is large, this may be very memory inefficient. Many parsers do not need full random access, but instead need a limited form of it -- for instance, the three-pass case of determining format and alphabet from the start of a sequence file. \ccode{ESL\_BUFFER} allows the caller to define an \emph{anchor} to define a start point in the input that is not allowed to go away until the caller says so. Setting an anchor declares that \ccode{mem[anchor..n-1]} is not be overwritten by new input reads. A new input read may first relocate (``reoffset'') \ccode{mem[anchor..n-1]} to \ccode{mem[0..n-anchor-1]} in order to use its current allocation efficiently. Setting an anchor may therefore cause \ccode{mem} to be reoffset and/or reallocated, and \ccode{balloc} may grow, if the buffer is not large enough to hold everything starting from the \ccode{anchor} position. When no anchors are set, \ccode{mem} will not be reoffset or reallocated. If we set an anchor at offset 0 in the input, then the entire input will be progressively slurped into a larger and larger allocation of memory as we read sequentially. We are guaranteed to be able to reposition the buffer anywhere from the anchor to n-1, even in a normally nonrewindable, nonpositionable stream. If we've read enough to determine what we need (format, alphabet...), we can release the anchor, and the buffer's memory usage will stop growing. The functions that get a defined chunk of memory -- \ccode{esl\_buffer\_GetLine()}, \ccode{esl\_buffer\_GetToken()}, and \ccode{esl\_buffer\_CopyBytes()} -- set an anchor at the start of the line, token, or chunk of bytes before they go looking for its end. This takes advantage of the anchor mechanism to make sure that the buffer will contain the entire line, token, or chunk of bytes, not just a truncated part. \subsection{Token-based parsing} A \esldef{token} is a substring consisting of characters not in a set of caller-defined \esldef{separator} characters. Typically, separator chararacters might be whitespace (\ccode{" \t"}). Additionally, newlines are always considered to be separators. Tokens cannot include newlines. In token-based parsing, we can handle newlines in two ways. Sometimes we might know exactly how many tokens we expect on the line. Sometimes we don't care. If the caller knows exactly how many tokens are expected on each line of the input, it should not include newline characters in its separator string. Now, if the caller asks for a token but no token remains on the line, it will see a special \ccode{eslEOL} return code (and the parser will be positioned at the next character after that newline). A caller can check for this deliberately with one last call to \ccode{esl\_buffer\_GetToken()} per line, to be sure that it sees \ccode{eslEOL} rather than an unexpected token. If the caller doesn't care how many tokens occur on each line, it should include newline characters (\verb+"\r\n"+) in the separator string. Then newlines are treated (and skipped) like any other separator. Starting from the current buffer position, the procedure for defining a token is: \begin{itemize} \item Skip characters in the separator string. (If end-of-file is reached, return \ccode{eslEOF}.) \item If parser is on a newline, skip past it, and return \ccode{eslEOL}. (Note that if the caller had newline characters in the separator string, the first step already skipped any newline, and no \ccode{eslEOL} return is possible.) \item Anchor at the current buffer position, \ccode{p}. \item From the current point, count characters \emph{not} in the separator, \ccode{n}. (Expand/refill the buffer as needed.) \item Define the token: \ccode{p[0..n]}. \item Move the current point to the character following the token. \end{itemize} \subsection{Newline handling.} Easel assumes that newlines are encoded as \verb+\n+ (UNIX, Mac OS/X) or \verb+\r\n+ (MS Windows). All streams are opened as binary data. This is necessary to guarantee a one:one correspondence between data offsets in memory and data offsets on the filesystem, which we need for file positioning purposes. It is also necessary to guarantee that we can read text files that have been produced on a system other than the system we're reading them on (that we can read Windows text files on a Linux system, for example).\footnote{That is, the usual ANSI C convention of reading/writing in ``text mode'' does not suffice, because it assumes the newlines of the system we're on, not necessarily the system that produced the file.} However, it makes us responsible for handling system-specific definition of ``newline'' character(s) in ASCII text files. \subsection{Implementation notes (for developers)} \paragraph{The state guarantee.} An \ccode{ESL\_BUFFER} is exchangeable and sharable even amongst entirely different types of parsers because it is virtually always guaranteed to be in a well-defined state. Specifically: \begin{itemize} \item \ccode{bf->mem[bf->pos]} is ALWAYS positioned at the next byte that a parser needs to parse, unless the buffer is at EOF. \item There are ALWAYS at least \ccode{pagesize} bytes available to parse, provided the input stream has not reached EOF. \end{itemize} \paragraph{State in different input type modes} There are six types (``modes'') of inputs: \begin{tabular}{ll} Mode & Description \\ \hline \ccode{eslBUFFER\_STDIN} & Standard input. \\ \ccode{eslBUFFER\_CMDPIPE} & Output piped from a command. \\ \ccode{eslBUFFER\_FILE} & A \ccode{FILE} being streamed. \\ \ccode{eslBUFFER\_ALLFILE} & A file entirely slurped into RAM. \\ \ccode{eslBUFFER\_MMAP} & A file that's memory mapped (\ccode{mmap()}). \\ \ccode{eslBUFFER\_STRING} & A string or memory. \\ \hline \end{tabular} The main difference between modes is whether the input is being read into the buffer's memory in chunks, or whether the buffer's memory effectively contains the entire input: \begin{tabular}{lll} & \ccode{STDIN, CMDPIPE, FILE} & \ccode{ALLFILE, MMAP, STRING} \\ \ccode{mem} & input chunk: \ccode{mem[0..n-1]} is \ccode{input[baseoffset..baseoffset+n-1]} & entire input: \ccode{mem[0..n-1]} is \ccode{input[0..n-1]} \\ \ccode{n} & current chunk size & entire input size (exclusive of \verb+\0+ on a \ccode{STRING}) \\ \ccode{balloc} & $>0$; \ccode{mem} is reallocatable & 0; \ccode{mem} is not reallocated \\ \ccode{fp} & open; \ccode{feof(fp) = TRUE} near EOF & \ccode{NULL} \\ \ccode{baseoffset} & offset of byte \ccode{mem[0]} in input & 0 \\ \end{tabular} \paragraph{Behavior at end-of-input (``end-of-file'', EOF).} The buffer can three kinds of states with respect to how near to EOF it is, as follows. During normal parsing, \ccode{bf->n - bf->pos >= bf->pagesize}: \begin{cchunk} mem-> {[. . . . . . . . . . . . . . . .] x x x x} ^ baseoffset ^ pos ^ n ^ balloc [~ ~ ~ ~ ~ ~ ~ ~] n-pos >= pagesize \end{cchunk} As input is nearing EOF, and we are within last bytes, \ccode{bf->n - bf->pos < bf->pagesize}: \begin{cchunk} mem-> {[. . . . . . . . . . . . . . . .] x x x x} ^ baseoffset ^ pos ^ n ^ balloc \end{cchunk} In modes where we might be reading input in streamed chunks (\ccode{eslBUFFER\_STDIN}, \ccode{eslBUFFER\_CMDPIPE} \ccode{eslBUFFER\_FILE}), \ccode{feof(bf->fp)} becomes \ccode{TRUE} when the buffer nears EOF. When the input is entirely EOF, then \ccode{bf->pos == bf->n}: \begin{cchunk} mem-> {[. . . . . . . . . . . . . . . .] x x x x} ^ baseoffset ^ n ^ balloc ^ pos \end{cchunk} \paragraph{ The use of \ccode{esl\_pos\_t}. } All integer variables for a position or length in memory or in a file are of type \ccode{esl\_pos\_t}. In POSIX, memory positions are an unsigned integer type \ccode{size\_t}, and file positions are a signed integer type \ccode{off\_t}. Easel wants to assure an integer type that we can safely cast to either \ccode{size\_t} or \ccode{off\_t}, and in which we can safely store a negative number as a status flag (such as -1 for ``currently unset''). \ccode{esl\_pos\_t} is defined as the largest signed integer type that can be safely cast to \ccode{size\_t} or \ccode{off\_t}.