.TH "esl\-alipid" 1 "@EASEL_DATE@" "Easel @EASEL_VERSION@" "Easel Manual" .SH NAME esl\-alipid \- calculate pairwise percent identities for all sequence pairs in an MSA .SH SYNOPSIS .B esl\-alipid [\fIoptions\fR] .I msafile .SH DESCRIPTION .PP .B esl\-alistat calculates the pairwise percent identity of each sequence pair in in the MSA(s) in .I msafile. For each sequence pair, it outputs a line of .I <seqname1> <seqname2> <%id> <nid> <denomid> <%match> <nmatch> <denommatch> where .I <%id> is the percent identity, .I <nid> is the number of identical aligned pairs, and .I <denomid> is the denominator used for the calculation: the shorter of the two (unaligned) sequence lengths. The %identity is defined as 100*nid/denomid. .PP The last three fields are the pairwise percent match calculation, in the pair\-HMM sense of a "match state" that aligns two residues XY (whether identical or different) versus delete \-Y and insert X\- states that have a residue in one sequence and a gap character in the other. That is, the %match is the percentage of the alignment that consists of aligned residues as opposed to insertions or deletions in either sequence. The %match is defined as 100*nmatch/denommatch. .PP There are many ways that one could choose a denominator for these percentages. We always define %id using MIN(len1,len2) as the denominator. In multiple sequence alignments, you will often have short sequence fragments which may have very little overlap, or even none at all. Several ways to calculate %identity, such as ignoring columns with gaps (100* n_identities / (n_identities + n_mismatches)) , or dividing by the total alignment length (100 * n_identities / ali_len), are not robust to having overlapping fragments or long indels, because you can get spuriously high or low %id's. .PP For both %identity and %match calculations, alignments of a gap character in both sequences, \-\-, aren't counted. Also, if the denominator is zero (which can happen, when two sequence fragments in the same MSA don't overlap each other), the resulting % is defined to be 0. .PP If .I msafile is \- (a single dash), alignment input is read from stdin. .PP Only canonical residues are counted toward .I <nid> and .I <n>. Degenerate residue codes are not counted. .SH OPTIONS .TP .B \-h Print brief help; includes version number and summary of all options, including expert options. .TP .BI \-\-informat " <s>" Assert that input .I msafile is in alignment format .IR <s> , bypassing format autodetection. Common choices for .I <s> include: .BR stockholm , .BR a2m , .BR afa , .BR psiblast , .BR clustal , .BR phylip . For more information, and for codes for some less common formats, see main documentation. The string .I <s> is case-insensitive (\fBa2m\fR or \fBA2M\fR both work). .TP .B \-\-amino Assert that the .I msafile contains protein sequences. .TP .B \-\-dna Assert that the .I msafile contains DNA sequences. .TP .B \-\-rna Assert that the .I msafile contains RNA sequences. .SH SEE ALSO .nf @EASEL_URL@ .fi .SH COPYRIGHT .nf @EASEL_COPYRIGHT@ @EASEL_LICENSE@ .fi .SH AUTHOR .nf http://eddylab.org .fi