merkurio

Crates.iomerkurio
lib.rsmerkurio
version1.0.2
created_at2025-07-25 09:43:51.97363+00
updated_at2025-11-28 16:05:04.725275+00
descriptionQuick k-mer-based FASTA/FASTQ sequence record extraction, and SAM/BAM record filtering plus file annotation with k-mer tags.
homepage
repositoryhttps://github.com/lschoenm/MerKurio
max_upload_size
id1767350
size19,627,001
Lukas Schönmann (lschoenm)

documentation

https://lschoenm.github.io/MerKurio/

README

MerKurio: sequence extraction and annotation based on matching k-mers


Build Status GitHub Release crates.io Docs License

MerKurio is a command line tool for extracting records from FASTA/FASTQ files based on k-mers, and for annotating and filtering aligned sequences in BAM/SAM format with k-mer tags.

It was developed to simplify downstream analysis of selected k-mers by tracing them back to their sequences in the original data.

MerKurio is designed to be user-friendly, flexible, fast, and memory efficient.

Table of Contents

- Documentation
- Features
- Example Workflow
- Usage
- Installation

Documentation

The full documentation can be found ➡️ here.

A quick overview of the features and usage is provided below.


Features

MerKurio provides two complementary subcommands:

  • 🔍 Extract: Search FASTA/FASTQ data for k-mers and write records with matching k-mers to the terminal or a new file.
    • Supports paired-end reads (a hit in one read extracts the whole pair).
  • 📑 Tag: Annotate BAM/SAM alignments with k-mer tags and filter them based on matching k-mers.
    • Adds a two‑letter tag (default km) with comma-separated matching k‑mers (follows the SAM format specification).
    • Optionally keeps only reads containing at least one k‑mer.
    • Multithreaded processing when working with BAM files.

Both commands share additional features:

  • Records detailed matching statistics (positions of k-mer occurence, summary statistics, metadata).
    • Human readable output in plain text.
    • Structured JSON logs for easy machine parsing.
  • Reads compressed input files (.gz, .bz2, .xz).
  • Can seach for reverse complements or only canonical forms of k-mers.
  • Case-insensitive search or conversion to lower-/uppercase.
  • Inverse matching to keep only those records without matches.
  • Query k-mers can be provided as command line arguments or in a file (FASTA or plain text).
  • File types are inferred automatically.
  • Record output can be suppressed to only record statistics.

Example Workflow

Two examples are prepared in this repository intended for Unix-like systems:

For a quick and minimal example, follow the steps described here.

A more detailed practical example/tutorial can be found in the repo or the docs.

Although the tool is designed to be flexible and can be used in a variety of ways, it was designed with the following workflow in mind:

  1. Have a list of k-mers of interest.
  2. Use MerKurio to extract the FASTA/FASTQ records containing these k-mers (and their reverse complements) from the original (paired-end) records.

If sequencing reads were extracted:

  1. Align the extracted reads to a reference genome.
  2. Use MerKurio to tag the aligned records in a SAM/BAM file with the significant k-mers they contain (and optionally filtering it).
  3. Analyzing the matching statistics generated by MerKurio.
  4. Interpret the results: e. g., by visualizing the k-mers that are frequently tagged in the reads that align to a particular region of the genome.

Usage

Run merkurio or one of its subcommands with the --help flag to see the available options and subcommands.

The extract Subcommand

Running merkurio extract will extract records from a sequence file (FASTA or FASTQ, format is inferred automatically) based on a list of query sequences (k-mers). The query sequences can be provided in a file or as a list of strings on the command line. Reverse complements can be included in the search. The extracted records are written to stdout or to a new file in the same format as the input file. The tool tries to select the most efficient algorithm for the given query sequences.

Detailed match statistics are written to stdout or to a file if specified, showing which records got hit by sequences along with a zero-based position. Matching statistics can also be saved in JSON format for easier parsing by other programs.

An example usage of the extract subcommand to extract records from a FASTA file based on a list of k-mers and their reverse complements is shown below; logging information is written to stdout:

merkurio extract -i input.fasta -o output.fasta -f query_kmers.txt -r -l

Another example where paired-end reads are extracted if they contain the sequence ACGT or TGCA; the extracted records are written to stdout and logging information is written to a file called log.txt (the -i and -1 flags can be used interchangeably):

merkurio extract -1 input_R1.fastq -2 input_R2.fastq -o output -s ACGT TGCA -l log.txt

The tag Subcommand

Running merkurio tag will tag aligned sequences in a BAM/SAM file with k-mers. If a record contains one or more of the k-mers, it is annotated with a tag ("km" by default; must be exactly two characters long) and the respective k-mers. Multithreading is supported for BAM files. Optionally, keep only records which are matching at least one k-mer.

Detailed match statistics are written to stdout or to a file if specified, showing which records got hit by sequences along with a zero-based position. Matching statistics can also be saved in JSON format for easier parsing. Matching records output can be suppressed if one is only interested in the matching statistics.

An example usage of the tag subcommand to tag a BAM file with the k-mers in the file query_kmers.fasta, with SAM output:

merkurio tag -i input.bam -o output.sam -f query_kmers.fasta

Another example where the k-mers are provided on the command line, and the search is also performed for their reverse complements. The tag is set to "MK". BAM file processing is done with 4 threads:

merkurio tag -i input.bam -o output.bam -s ACGT TGCA -r -p 4 -t MK

Installation

You can install MerKurio in several ways, depending on your system and whether you have Rust installed.

1. Precompiled Binaries (No Rust Needed)
2. Install via Cargo (Requires Rust)
3. Build Manually Without Installing (Requires Rust)

After installation, verify if it works by running:

merkurio --help

Or, if you didn't add it to your PATH:

./path/to/merkurio --help

Option 1: Precompiled Binaries (No Rust Needed)

Download a binary for Linux, Windows, or macOS from the releases page, then extract the archive:

tar -xzf path/to/release.tar.gz

On Linux/macOS, make it executable if needed:

chmod u+x path/to/merkurio

The merkurio-x86_64-unknown-linux-musl binary is compatible with a wider range of systems but can have worse performance.

Option 2: Install via Cargo (Requires Rust)

If you have Rust installed (edition 2024), the easiest way is:

cargo install merkurio

This pulls the latest version from crates.io.

To install a tagged release from GitHub:

cargo install --git https://github.com/lschoenm/MerKurio --tag vX.X.X

Option 3: Build Manually Without Installing (Requires Rust)

git clone https://github.com/lschoenm/MerKurio
cd MerKurio
cargo build --release

The binary will be in target/release/.

License

The code in this repository is licensed under the MIT license.

Test data and example files in the tests/ and example-minimal/ directories are licensed under the CC0 1.0 Universal license.

Commit count: 0

cargo fmt