Crates.io | grepq |
lib.rs | grepq |
version | 1.1.4 |
source | src |
created_at | 2024-11-03 06:57:54.534626 |
updated_at | 2024-11-10 02:36:45.726587 |
description | quickly filter fastq files by matching sequences to a set of regex patterns |
homepage | https://github.com/Rbfinch/grepq |
repository | https://github.com/Rbfinch/grepq |
max_upload_size | |
id | 1433431 |
size | 1,888,009 |
quickly filter fastq files by matching sequences to a set of regex patterns
1. Very fast and scales to large fastq files
On a Mac Studio with 32GB RAM and Apple M1 max chip, grepq
processed a 104GB fastq file against 30 regex patterns in 88 seconds, about 1.2GB of fastq data per second. And for the same fastq file and 30 regex patterns, getting an ordered count of each matched regex using the tune
subcommand took less than five seconds for 100,000 fastq records.
For a 874MB fastq file, it was around 4.8 and 450 times faster than the general-purpose regex tools ripgrep
and grep
, respectively, on the same hardware.
2. Does not match false positives
grepq
will only match regex patterns to the sequence field of a fastq record, which is the most common use case. Unlike ripgrep
and grep
, which will match the regex patterns to the entire fastq record, which includes the record ID, sequence, separator, and quality. This can lead to false positives and slow down the filtering process.
3. Output matched sequences to one of three formats
-I
option)-R
option)4. Will tune your pattern file with the tune
subcommand
Use the tune
subcommand to analyze matched substrings and update the number and/or order of regex patterns in your pattern file according to their matched frequency. This can speed up the filtering process.
Specifying the -c
option to the tune
subcommand will output the matched substrings and their frequencies, ranked from highest to lowest.
5. Supports inverted matching with the inverted
subcommand
Use the inverted
subcommand to output sequences that do not match any of the regex patterns in your pattern file.
6. Plays nicely with your unix workflows
For example, see tune.sh
in the examples
directory. This simple script will filter a fastq file using grepq
, tune the pattern file on a user-specified number of fastq records, and then filter the fastq file again using the tuned pattern file for a user-specified number of the most frequent regex pattern matches.
Get instructions and examples using grepq -h
, and grepq tune -h
and grepq inverted -h
for more information on the tune
and inverted
subcommands, respectively.
grepq
has been tested on Linux and macOS. It might work on Windows, but it has not been tested.rustup update
From crates.io (easiest method)
cargo install grepq
From source
cd
into the grepq
directorycargo build --release
./target/release
PATH
or use the full path to the executablegrepq -h
will show you the available options and subcommands, with examples of how to use them.
Checksums to verify grepq
is working correctly, using the regex file regex.txt
and the small fastq file small.fastq
, both located in the examples
directory:
(note replace ./target/release/grepq
with grepq
if you installed from crates.io)
./target/release/grepq ./examples/regex.txt ./examples/small.fastq > outfile.txt
sha256sum outfile.txt # checksum of outfile.txt if no option is given
ed0527a4d03481a50b365b03f5d952afab1df259966021699484cd9d59d790fc
./target/release/grepq -I ./examples/regex.txt ./examples/small.fastq > outfile.txt
sha256sum outfile.txt # checksum of outfile.txt if -I option is given
204bec425495f606611ba20605c6fa6e6d10627fc3203126821a2df8af025fb0
./target/release/grepq -R ./examples/regex.txt ./examples/small.fastq > outfile.txt
sha256sum outfile.txt # checksum of outfile.txt if -R option is given
67ad581448b5e9f0feae96b11f7a48f101cd5da8011b8b27a706681f703c6caf
Count of the top five most frequently matched patterns found in SRX26602697.fastq using the pattern file SARS-CoV-2.txt (this pattern file contains 64 sequences of length 60 from Table II of this preprint):
time grepq SARS-CoV-2.txt SRX26602697.fastq tune -n 10000 -c | head -5
GTATGGAAAAGTTATGTGCATGTTGTAGACGGTTGTAATTCATCAACTTGTATGATGTGT: 1595
CGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGCAAAGA: 693
TCCTTACTGCGCTTCGATTGTGTGCGTACTGCTGCAATATTGTTAACGTGAGTCTTGTAA: 356
GCGCTTCGATTGTGTGCGTACTGCTGCAATATTGTTAACGTGAGTCTTGTAAAACCTTCT: 332
CCGTAGCTGGTGTCTCTATCTGTAGTACTATGACCAATAGACAGTTTCATCAAAAATTAT: 209
________________________________________________________
Executed in 236.47 millis fish external
usr time 203.88 millis 0.12 millis 203.76 millis
sys time 34.74 millis 13.57 millis 21.16 millis
see CHANGELOG
MIT