.TH "easel downsample" 1 "@EASEL_DATE@" "Easel @EASEL_VERSION@" "Easel Manual" .SH NAME easel downsample \- select random subset of lines or sequences .SH SYNOPSIS .B easel downsample [\fIoptions\fR] .I M .I infile .SH DESCRIPTION .PP Given an .I infile that contains .I N things (lines, sequences...), .B easel downsample randomly selects a subset of .I M of those things (M <= N) and outputs them to .I stdout. .PP If .I infile is \- (a single dash) input is read from .I stdin. (The .B \-S option can't read from stdin.) .PP The default is to downsample individual lines from a text .I infile. With the .B -s or .B -S option, .I infile is a sequence file (in any format that Easel accepts), and it downsamples sequence records. .PP Uses an efficient reservoir sampling algorithm that only requires memory proportional to the sample size .I M, independent of the total input size .I N, and usually requires only a single pass through .I infile. Still, if .I M is large, memory usage could be a concern. The default line sampler holds .I M lines in memory, so it uses about .I ML bytes of memory, for mean line length .I L. The .B -s sequence sampler holds .I M sequence objects in memory, including metadata. The .B -S "big" sequence sampler is a more memory efficient version that only needs .I 8M bytes, but it has some restrictions on its use, described below. .PP Otherwise the magnitude of .I M is essentially unrestricted; it is a 64-bit integer. .B easel downsample is designed to handle samples of billions of sequences if necessary. .SH OPTIONS .TP .B \-h Print brief help; includes version number and summary of all options. .TP .B \-s Sequence sampling. .I infile is a sequence file, in any valid Easel sequence format (including multiple sequence alignment files). The sample of sequences needs to fit in memory, so .I M should not be outrageously large. Because the sequences pass through the Easel sequence data parser, there can be some metadata loss. When .I infile is a multi-MSA file (e.g. Pfam or Rfam), .I N includes all alignments, not just the first one. The output is in FASTA format. .TP .B \-S "Big" sequence sampling. .I M can be reasonably outrageous (a billion sequences will require about 8G RAM). .I infile needs to be an actual file (not a pipe or stream), because this option keeps only disk offsets to define the sample, then uses each offset to go and seek each sequence record in the file. Additionally, .I infile must be an unaligned sequence file format, not in a multiple sequence alignment format, because the mechanics of .B \-S assume that each sequence record is a contiguous chunk of the file. Each sampled sequence record is echoed to the output, so each record is exactly as it appeared in its native format; there is no metadata loss, and the output is in the same format that .I infile was in. .SH EXPERT OPTIONS .TP .BI \-\-seed " " Set the random number seed to .I , an integer >= 0. The default is 0, which means to use a randomly selected seed. A seed > 0 will result in identical samples from different runs of the same .B easel downsample command. .SH SEE ALSO .nf @EASEL_URL@ .fi .SH COPYRIGHT .nf @EASEL_COPYRIGHT@ @EASEL_LICENSE@ .fi .SH AUTHOR .nf http://eddylab.org .fi