Benchmarking methodology used internally in HMMER testing: the "profmark" benchmark. Table of contents: 1. Overview of files in the profmark directory. 2. create-profmark: creating a new benchmark dataset 3. format of {.tbl,.msa,.fa,.pos,.neg} files in a benchmark dataset 4. finding summary statistics of a benchmark dataset 5. pmark-master.pl: running a pmark benchmark, in parallel using SGE qsub 6. x-: benchmark driver scripts 7. format of benchmark results output files 8. rocplot: displaying results as ROC graphs ================================================================ = 1. Overview of files ================================================================ create-profmark.c : Creates a new benchmark dataset. pmark-master.pl : Master script that parallelizes the running of a benchmark. x-hmmsearch : H3 hmmsearch benchmark (subsidiary to pmark-master.pl) x-phmmer-fps : phmmer family-pairwise-search benchmark x-phmmer-consensus : phmmer consensus query benchmark rocplot.c : Constructs ROC plot of results, with bootstrapped confidence intervals. rocplot.pl : quick and dirty ROC plot, no bootstrapping. ================================================================ = 2. create-profmark: creating a new benchmark dataset ================================================================ create-profmark is compiled from ANSI C source create-profmark.c; see Makefile.in for build. Usage: ./create-profmark Example: ./create-profmark pmark Pfam-A.seed uniprot-sprot.fa Creates five output files: .tbl - table summarizing the benchmark .msa - MSA queries, stockholm format .fa - sequence targets, fasta format .pos - table summarizing positive test set .neg - table summarizing negative test set The format of these files is described in the following section. Briefly: each input alignment in the is split into a query alignment and a nonredundant set of test domains by two single-linkage clustering steps, such that no test domain has >= 25% identity to any query sequence and no test domain is >= 50% identical to any other test domain. Positive test sequences are constructed by embedding two test domains in a larger sequence, with nonhomologous segments generated by selecting a random segment of the and shuffling it. Negative test sequences are constructed by sampling a positive test sequence at random (for its segment length structure) and replacing all its segments with nonhomologous shuffled segments. In more detail, including options that change the default benchmark construction protocol, the procedure is as follows: : for each MSA in : : Filter out sequence "fragments". Defined as sequences < x*average length (in residues); default x=0.70; controlled by -F option. : Try to split into a training set and test set such that no test sequence has >= x% identity to any training sequence. Uses efficient single-linkage-clustering at x threshold; defines largest cluster as training set, all other clusters as test set. Pairwise % identity as defined by on the given alignment. Default x=25% identity; controlled by -1 option. : For the remaining test sequences, remove redundancy by performing another single linkage clustering by percent identity, at another threshold x. Default x=50%; controlled by -2 option. For each cluster, one test sequence is chosen at random as a representative. : If less than two test sequences have been identified that meet these criteria, fail and move on to the next MSA. (For instance, if all the sequences in the MSA are highly identical, don't bother using this MSA for a benchmark; it's too easy.) : Permute the order of the training set alignment. (This allows single-sequence-query tests to start on a random but reproducible single query sequence, simply by taking the first seq in the alignment.) Other than this (row) permutation, the training set is preserved in its original alignment. : The training set alignment is written to .msa. : The test set is now considered to be "test domains", and we construct positive test sequences by embedding these domains in nonhomologous sequence constructed from sequences in the as follows: : choose two test domains (by default; --single option makes it one) : select a sequence length from at random that is sufficiently long to embed both test domains. : embed the domain(s) at random positions in that sequence length : construct the non-domain regions of the test sequence with "nonhomologous" sequence segments. There are several ways to construct "nonhomologous" sequence segments, all of which (except --iid) start with selecting a random subsequence of appropriate length from the : : shuffle the selected subsequence. (default; --mono) : shuffle preserving diresidue composition. (--di) : synthesize with same monoresidue comp (--markov0) : synthesize with same diresidue comp (--markov1) : reverse C-to-N direction (--reverse) : synthesize iid sequence (--iid) : The embedded-domain test sequences are written in FASTA format to the .fa file. Details of how they were constructed (where all the sequence segments came from) are written to the .pos file. : Finally a set of negative test sequences is constructed. Default N=200000; controlled by the -N option. For each negative: : choose one of the positive test seqs at random; we will use the same set of segment lengths to construct a negative. : for every segment, insert a nonhomologous sequence generated by the same method we used for nonhomologous segments in the positive sequences. : The test sequences are appended in FASTA format to the to the .fa file. Details of how they were constructed (where all the sequence segments came from) are written to the .neg file. ================================================================ = 3. Format of {.tbl,.msa,.fa,.pos,.neg} files in a benchmark dataset ================================================================ The .tbl file is used by pmark-master.pl as a list of MSA queries in the benchmark. Each line has eight whitespace-delimited fields: For example: 2OG-FeII_Oxy 27% 167 141 0 113 16 8 3-alpha 24% 48 10 0 6 4 2 3_5_exonuc 24% 219 28 0 21 5 2 4HBT 20% 99 141 0 97 37 18 In more detail, these eight fields are: : name of the MSA query, as given in original : average % id in the training set alignment, as calculated by esl_dst_XAverageId(). : length of the training set alignment in columns. : number of sequences in original alignment after fragments were removed. : number of sequences that were excluded from the original alignment because they appeared to be fragments. (+ = number of seqs in original MSA) : number of sequences in the training set alignment saved to .msa : number of sequences that passed criteria for use as a possible embedded domain in the test sequences. (+ <= ) : number of embedded-domain test sequences saved to .fa The .msa file is a Stockholm file containing all the query alignments. The .fa file is a FASTA file containing the positive and negative test sequences. The .pos file is a log of where all the segments in the positive test sequences came from. [ ]... : name of the test sequence. These are constructed as /<#>/-/-, where <#> ranges from 1 to the number of test sequences, and - and - are the coordinates in the test sequence that correspond to homologous domains. This naming structure is exploited by benchmark scripts to know that the sequence is a positive, and to know exactly where the bounds of homologous domains are. In a single-domain embedded test (--single), the names are /<#>/-. : length of the test sequence in residues : length of the first nonhomologous segment : length of the first homologous test domain : length of the 2nd nonhomologous segment : length of the second homologous test domain (or 0) : length of the 3rd nonhomologous segment (or 0) then for each segment (5 for the default two-domain embedding; 3 for --single): : name of the source sequence for the segment; in for nonhomologous segments, and in the original MSA for homologous segments. : start coord in the source : end coord in the source The .neg file is a log of where all the segments in the negative test sequences came from. It has essentially the same format as the .pos file, except that negative sequences are all named <#> (example: "decoy2012"), and the benchmark scripts rely on this to identify negative test sequences. ================================================================ = 4. Finding summary statistics of a benchmark dataset ================================================================ Number of query alignments: wc -l .tbl Number of positives: wc -l .pos Number of negatives: wc -l .neg Test sequence length dist: esl-seqstat .fa # of training seqs: avg -f6 .tbl # of test seqs: avg -f8 .tbl ================================================================ = 5. pmark-master.pl: running a pmark benchmark, in parallel using SGE qsub ================================================================ Usage: ./pmark-master.pl Example: ./pmark-master.pl ~/releases/hmmer-release/build-icc ~/releases/hmmer-release h3-results 100 pmark ./x-hmmsearch pmark-master.pl is a wrapper that coarse-grain parallelizes a benchmark run for our SGE queue. This would typically be run in a notebook directory, which would have symlinks to the pmark-master.pl script and the driver x-* scripts, and also would have the .{tbl,msa,fa,pos,neg} files either present or symlinked. For each benchmark you run that day, you'd have a different ; for example, you might run a "h3" and "h2" benchmark. The name should be short because it is used to construct other names, including the SGE job name. The pmark-master.pl creates the , splits the .tbl file into separate tbl files called /tbl., and issues "qsub" commands to run the on each of these subtables. The jobs in SGE are named .. The is passed 7 arguments: where is /tbl. (the list of queries this parallelized instance of the benchmark driver is supposed to process), and is a file named /tbl.out. When all the driver script instances are done, there will be output files named /tbl.out. These files can be analysed and turned into ROC graphs using rocplot and/or rocplot.pl. ================================================================ = 6. x-: benchmark driver scripts ================================================================ A driver script gets the 7 arguments as described above: : path to top-level build directory of the program(s) to be tested, where executables can be found. When testing non-HMMER programs, this is the installation directory where those executables are found. : path to top-level src directory of the HMMER program(s) to be tested, where data and scripts may be found. When testing non-HMMER programs, this is nonetheless still a HMMER top-level src directory, where parser scripts and data may be found. (For example, BLAST output parsers: our demotic perl modules.) : name of the directory that temp files can be placed in, unique to the instantiation of pmark-master.pl that called us, so long as we use names that don't clash with the other -1 instances of driver scripts; for example, we can safely create tmp files that start with . : name of the tbl. file in that lists the query alignments this instantiation of the driver is supposed to work on. : the benchmark's MSA file; query alignments named in will be esl-alifetch'ed from here. : the benchmark's positive and negative sequences; this will be used as the target database for searches the driver runs. As a special case (i.e. hack), iterative search benchmarks may look for related files, such as , a sequence database to iterate on first before running on . : a whitespace-delimited tabular output file, one line per target sequence, described below. Using and allows us to easily construct regression tests of different HMMER versions and/or configurations. Why the BEGIN {} clause: you'll see a BEGIN {} clause wrapping cmdline argument parsing on many of the x- scripts. That's a trick that goes with "use lib ${top_srcdir}/easel/demotic", where we need the demotic library included, but we don't know where it is until we get ${top_srcdir} from the cmdline, so we defer the 'use lib'. Error handling: why don't these scripts call die() on errors? That's deliberate. If a step fails, the script prints an error message and skips to the next query, without cleaning up any tmp files from the failed query. Thus we continue collecting data from as much of the .tbl file of queries as possible, while saving tmp files that we'll need for debugging later. If you have the x- script fail w/ a fatal error, it will stop in the middle of a tbl file, which unnecessarily compromises the completeness of a benchmark (say, if one piddly query out of 10000 fails, there's no reason to bring down the whole benchmark). ================================================================ = 7. format of benchmark results output files ================================================================ The output in files /tbl.out have four fields: for example: 6e-41 144.1 UCH/6548/39-342/368-893 UCH 1.5e-34 123.1 UCH/6546/168-490/816-1424 UCH 0.17 14.4 zf-UBP/6823/3-74/209-283 UCH 1.4 11.4 decoy31607 UCH These can be concatenated and sorted by E-value: cat *.out | sort -g Analysis scripts can easily tell the difference between a true, false, and "ignored" comparison: - if the target is named "decoy", it's a negative (nonhomologous) comparison. - if the name of the query matches the first part of the name of the target, it's a true (homologous) comparison. - if the name of the query doesn't match the first part of the target name, we will ignore the comparison. This is a match between a positive sequence and a query that doesn't correspond to the positive test domains; it might be a false hit, but it also might be that the two alignments (the query and the one that generated the test sequence) are homologous. Thus it's easy to keep track (even during a run) of the top-scoring false positive: cat *.out | sort -g | grep decoy | head and using the rocplot.pl script, it's easy to see get a glimpse of how the ROC plot is turning out: cat *.out | sort -g | ../rocplot.pl | head ================================================================ = 8. rocplot: displaying results as ROC graphs ================================================================ rocplot is compiled from ANSI C source rocplot.c; see Makefile.in for build. Usage: ./rocplot Example: cat *.out | sort -g | ../rocplot pmark - > results.xy The output is an XMGRACE xydydy file, plotting fractional coverage of the positives (on the Y-axis; range 0..1) versus errors per query (on the X-axis; ranging from 1/(# of models) to 10.0 by default; see --min and --max options). For each point, a 95% confidence interval is denoted by the dydy points, as determined by "Bayesian" bootstrap resampling of the query alignments. Output from rocplot over a set of different benchmarks are usually concatenated into one XMGRACE input .dat file, and worked up into a display following a procedure akin to: cat bench1/*.out | sort -g | ./rocplot pmark - > bench1.dat cat bench2/*.out | sort -g | ./rocplot pmark - > bench2.dat cat bench1.dat > todays.dat cat bench2.dat >> todays.dat \cp todays.dat todays.agr xmgrace -settype xydydy -param ~/src/hmmer/profmark/pmark.param todays.agr then manually making the lines pretty colors as in All: symbols circle 56 opaque white fill no riser set 0 bench1 orange set 1 bench2 black and saving (as .agr) and exporting (as .eps): Figure: todays.{dat,agr,eps}