primerpincer

Crates.io	primerpincer
lib.rs	primerpincer
version	0.8.0
created_at	2025-11-12 20:25:46.474801+00
updated_at	2025-11-21 16:25:03.622867+00
description	A CLI primer trimming tool for long-read sequencing data
homepage	https://github.com/mauricebarrett/primerpincer
repository	https://github.com/mauricebarrett/primerpincer
max_upload_size
id	1929900
size	115,658

(mauricebarrett)

documentation

https://github.com/mauricebarrett/primerpincer

README

🦀 PrimerPincer 🦀

Installation

Install cargo

First install cargo!

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install primerpincer

Now you can install primerpincer. The most straightforward way is

cargo install primerpincer

However to enable SIMD optimizations in Sassy the following methods can be used.

RUSTFLAGS="-C target-cpu=native" cargo install primerpincer

About

PrimerPincer is a Rust-based command-line tool designed to efficiently detect and remove pairs (forward and reverse) of primers from single-end amplicon reads in FASTQ format, with a particular focus on long-read sequencing data generated by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

In amplicon-based microbiome studies, such as those targeting 16S, ITS, 18S, or COI regions, primer removal is a crucial preprocessing step. The phylogenetically conserved regions where primers bind are typically removed because:

They are phylogenetically uninformative, and their removal can improve the accuracy of downstream taxonomic classification.
They are susceptible to PCR-induced mutagenesis, and therefore may not accurately represent true biological sequences.
They often contain uninformative sequence data, and removing them can enhance computational performance in subsequent analyses.

The rise of third-generation sequencing platforms from PacBio and ONT has enabled the use of much longer marker gene regions than was previously feasible—such as the full-length 16S (V1–V9), 16S–ITS–23S operon, or 18S–ITS–28S operon. Additionally, the throughput and read counts produced per run continue to increase, driving a steady growth in the total volume of sequencing data generated.

PrimerPincer is designed to scale with these demands, providing rapid and accurate primer identification and removal for long-read datasets—with performance and scalability built for the future of sequencing.

Features

⚡ Lightning Fast

Rust-based performance with zero-cost abstractions
Parallel processing using Paraseq for multi-threaded FASTQ parsing and processing
SIMD optimizations available

🔍 Multiple Search Algorithms

Choose the best algorithm for your use case:

Sassy (default) - Approximate string matching as described in Beeloo and Groot Koerkamp (2025)
Myers - Approximate pattern matching algorithm as described in Myers (1999). Implementation is very similar to Edlib’s (Šošić and Šikić, 2017).
Hamming - Hamming distance string matching with mistmatch tolerance
BNDM - Exact match only. No mistmatch or indels tolerance

📦 Compression Format Support

Automatically handles common compression formats via niffler:

Input: gzip (.gz), zstd (.zst), xz (.xz), bzip2 (.bz2), and uncompressed FASTQ (auto-detected)
Output: User-selectable via --compression flag (gzip, bzip2, xz, zstd, or uncompressed; defaults to gzip)

🧬 IUPAC Aware

Full support for IUPAC nucleotide ambiguity codes in primer sequences:

Standard codes: R, Y, M, K, S, W, B, D, H, V, N
Automatically expands degenerate primers or uses degenerate-aware matching algorithms
Proper reverse complement handling for all ambiguity codes

🔄 Orientation normalization

The tool checks forward orientation first, followed by reverse orientation:

If both primers are found in forward orientation of the read, the read is kept as-is
If not found, the reverse orientation is searched for both primers
If both primers are found in reverse orientation of the read, the read is kept and the reverse complement is output

📏 Size filtering

An optional size filtering can be applied:

Minimum length to accept amplicons
Maximum length to accept amplicons

✅ Quality filtering

Reads that fall below a determined average Phred quality score threshold are filtered out:

Averaging basecall quality scores is calculated as Wouter De Coster outlines in his blog
This aims to replicate the functionality of chopper
For more advanced quality trimming options, see chopper!

Usage

PrimerPincer - a CLI tool for the rapid identification and removal of paired primers from long read amplicons

Usage: primerpincer [OPTIONS] --input <FILE> --output <FILE> --forward <SEQUENCE> --reverse <SEQUENCE>

Options:
  -i, --input <FILE>
          Input FASTQ file

  -o, --output <FILE>
          Output FASTQ file

  -f, --forward <SEQUENCE>
          Forward primer sequence (5' to 3' orientation)

  -r, --reverse <SEQUENCE>
          Reverse primer sequence (5' to 3' orientation)

  -a, --algorithm <ALGORITHM>
          Algorithm to use for primer matching

          Possible values:
          - sassy:   Pattern matching algorithm as described in Beeloo and Koerkamp (2025)
          - myers:   Rust Bio's Myers bit-parallel algorithm, very similar to Edlib's algorithm as described in Šošić and Šikić (2017)
          - hamming: Hamming distance algorithm as described in Waterman and Eggert (1987). Can tolerate mismatches but not indels
          - bndm:    Rust Bio's BNDM exact pattern matching algorithm as described in Baeza-Yates and Gonnet (1992). Exact matching only. No mismatch or indels tolerated

          [default: sassy]

  -e, --error-rate <FLOAT>
          Maximum error rate in primer matching (e.g., 0.15 for 15% errors)

          [default: 0.15]

  -w, --window-size <INT>
          Window size to search for primer at start and end of sequence

          [default: 100]

  -O, --overlap <MINLENGTH>
          Minimum overlap length. Require MINLENGTH bases of the primer to match (default 6)

          [default: 6]

  -t, --threads <INT>
          Number of threads to use

          [default: 4]

  -c, --compression <COMPRESSION>
          Compression format for the output FASTQ (defaults to gzip)

          Possible values:
          - none:  No compression; write plain text FASTQ
          - gzip:  Standard gzip compression
          - bzip2: bzip2 compression
          - xz:    LZMA/XZ compression
          - zstd:  Zstandard compression

          [default: gzip]

  -m, --min-length <INT>
          Minimum read length after trimming (inclusive)

  -M, --max-length <INT>
          Maximum read length after trimming (inclusive)

  -q, --min-average-quality <FLOAT>
          Minimum Average Quality Score

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Examples

primerpincer \
 -i ./example_data/raw/ATCC-MSA1003-toy-example.fastq.gz \
 -o ~/primerpincer_proccesed/ATCC-MSA1003-toy-example.fastq.gz \
 -f "AGRGTTYGATYMTGGCTCAG" \
 -r "RGYTACCTTGTTACGACTT"  \
 -t 12 \
 -a sassy \
 -O 6 \
 -l 500

Contributing

Contributions to PrimerPincer are welcome! Here are some ways you can contribute:

Reporting Issues

Report bugs or request features by opening an issue on GitHub
Include example data and error messages when reporting bugs
Describe your use case when requesting new features

Contributing Code

Fork the repository
Create a new branch for your feature (git checkout -b feature/amazing-feature)
Make your changes
Run tests to ensure everything works
Format your code using cargo fmt and ensure it passes cargo clippy --all-targets -- -D warnings
Commit your changes using Conventional Commits format (e.g., feat: add new algorithm, fix: resolve compilation error)
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

CI Checks: All pull requests will be automatically checked by our CI workflow (.github/workflows/ci.yaml):

Commit messages must follow the Conventional Commits specification (validated by Commitizen)
Code formatting must pass cargo fmt --all -- --check
Code compilation must pass cargo check --all-targets
Code linting must pass cargo clippy --all-targets -- -D warnings

All CI checks must pass before your PR can be merged.

Citation

If you use PrimerPincer in your research, please cite:

Beeloo, R. & Groot Koerkamp, R. Sassy: Searching Short DNA Strings in the 2020s. 2025.07.22.666207 Preprint at https://doi.org/10.1101/2025.07.22.666207 (2025).

Licence

This project is licensed under the MIT License - see the LICENSE file for details.

Commit count: 0