| Crates.io | klassify |
| lib.rs | klassify |
| version | 0.1.6 |
| created_at | 2024-07-31 00:46:51.352755+00 |
| updated_at | 2025-08-21 22:30:39.325386+00 |
| description | Classify chimeric reads based on unique kmer contents |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1320486 |
| size | 30,819,114 |

Classify chimeric reads based on unique k-mer contents and identify the breakpoint locations.
The breakpoints can be due to:
While there are many tools that can identify structural variations, this tool is designed to compare progeny (e.g. F1) reads to the parental genome. The key idea is an extension to the trio binning approach, where we use the unique kmers from each chromosome/contig of the parental genomes to classify the reads that bridge two different chromosomes/contigs.
Following are examples of recominant reads identified by this tool:

If you don't have Rust installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
With Rust installed, you can just install the software with cargo.
cargo install klassify
Typical installation time on a desktop computer is ~1 minute.
Additional dependencies include:
We have tested latest version (0.1.4 and above) on the following OS:
Suppose you have 3 input files, with a toy example available in examples:
parents.genome.fa: the parental genomesf1_reads.fa: the progeny readsparent_reads.fa: the parental readsThe simplest way to run the tool is to use the following commands:
klassify pipeline f1_reads parent_reads.fa parents.genome.fa
That's it! The breakpoint locations in the parental genomes are in
f1_classify.roi.paired.regions. The output may look like this:
SoChr01B:71411-81028
SoChr01F:81751-88094
These indicate that the breakpoint is between SoChr01B:71411-81028 and
SoChr01F:81751-88094. Every two lines indicate a pair of breakpoints in this file.
The pipeline will run the entire pipeline, which is sufficient for small genomes. However, for larger genomes,
users are encouraged to follow the steps below to run the pipeline in a more controlled manner. Many steps
can run on a bunch of FASTA/FASTQ files (for example, by first using faSplit) to achieve better parallelism on
larger datasets:
cd examples
mkdir -p ref f1_reads f1_classify parent_reads parent_classify
klassify build parents.genome.fa -o kmers.bc
This generates an index for all the unique kmers (present in a single contig/chromosome).
klassify classify kmers.bc f1_reads.fa -o f1_classify
klassify extract f1_classify.filtered.tsv f1_reads.fa -o f1_classify.fa
minimap2 -t 80 -ax map-hifi --eqx --secondary=no parents.genome.fa f1_classify.fa \
| samtools sort -@ 8 -o f1_classify.bam
klassify classify kmers.bc parent_reads.fa -o parent_classify
klassify extract parent_classify.filtered.tsv parent_reads.fa -o parent_classify.fa
minimap2 -t 80 -ax map-hifi --eqx --secondary=no parents.genome.fa parent_classify.fa \
| samtools sort -@ 8 -o parent_classify.bam
klassify regions f1_classify.bam parent_classify.bam
Note that at this stage, we already have rough breakpoint locations (10kb resolution) in the parental genomes are in
-f1_classify.regions.tsv. To further refine the breakpoint locations, we run two more steps below.
klassify extract-bam f1_classify.regions.tsv f1_classify.bam
klassify breakpoint kmers.bc f1_classify.regions.fasta
minimap2 -t 80 -ax map-hifi --eqx --secondary=no parents.genome.fa f1_classify.regions.split.fasta \
| samtools sort -@ 8 -o f1_classify.roi.bam
klassify cluster-pairs f1_classify.roi.bam > f1_classify.roi.tsv
The breakpoint locations can then be visualized in IGV for read evidence in
f1_classify.bam, using parents.genome.fa as the reference.
Total expected run time on a desktop computer is ~1 minute.
The KLASSIFY pipeline identifies the breakpoints using the set of F1 reads, with parent reads as control. The breakpoints identified from the F1 reads were then mapped back to the parent reference sequences to obtain precise coordinates.
The KLASSIFY algorithm works as follows:
Find unique k-mers that belong to each chromosome, e.g. SoChr01A, SoChr01B, etc.
Identify ‘chimeric’ F1 reads that contain unique k-mers that belong to at least 2 chromosomes (default: ≧300 unique k-mers on the read, A unique + B unique ≧50% of unique k-mers on the read, and B unique ≧10%)
Repeat step 2 similarly with parent reads
Using parent reads as ‘control’, identify the ‘chimeric’ regions that show up with at least 5 F1 reads, but not with parent reads (therefore unaffected by assembly errors or repeats)
Collect all ‘chimeric’ reads identified so far and split them into 2 parts. The reads are split by identifying the switch from one chromosome to another based on unique k-mers
Map the split reads to the reference sequences to identify parent regions where each part of the ‘chimeric’ reads separately map to
Pair the separate regions up to compile a candidate list of paired breakpoints
Use Integrated Genome Viewer (IGV) to proof the paired breakpoints. Label the breakpoint as either “Type I”, “Type II”, or “bad” (see next section for definition of types)