Crates.io | kraken2_rs |
lib.rs | kraken2_rs |
version | 0.6.12 |
source | src |
created_at | 2024-08-23 14:17:51.516073 |
updated_at | 2024-08-23 14:21:21.308357 |
description | An ultra-fast, low-memory footprint and accurate taxonomy classifier for all |
homepage | |
repository | https://github.com/jianshu93/kraken2-rust |
max_upload_size | |
id | 1349222 |
size | 705,256 |
The original author refused to merge my pull request to fix error for compiling on MacOS (reqwest crate) and benchmark with original kraken2 using real-world datasets (https://github.com/eric9n/Kun-peng/pull/28). I found that this implementation is less accurate than original Kraken2 for many banchmarks. Therefore, I added benchmark results using real-world datasets. Also, I cleaned many non-English annotation/documentation. I also use the improved HyperLogLog estimator invented in Ertl 2017 paper (https://arxiv.org/abs/1702.01284) to determine hash table size.
Credit to original author: https://github.com/eric9n/Kun-peng. Below are updated README with benchmark results in the end.
We developed Kun-peng, an accurate and highly scalable low-memory tool for classifying metagenomic sequences.
Inspired by Kraken2's k-mer-based approach, Kun-peng incorporates an advanced sliding window algorithm during sample classification and, crucially, employs an ordered chunks method when building the reference database. This approach allows the database to be constructed in the format of sub-databases of any desired chunk size, significantly reducing running memory usage by orders of magnitude. These improvements enable running Kun-peng on personal computers and HPC platforms alike. In practice, for any larger indices, the Kun-peng would allow the taxonomic classification task to be executable on essentially all computing platforms without the need for the traditionally expensive and rare high-memory node.
Importantly, the flexible structure of the reference index also allows the construction and utilization of supermassive indices that were previously infeasible due to computational restraints. Supermassive indices, incorporating the growing genomic data from prokaryotes and eukaryotes, as well as metagenomic assemblies, are crucial in investigating the more diverse and complex environmental metagenomes, such as the exposome research.
The name "Kun-peng" is a massive mythical creature capable of transforming from a giant fish in the water (Kun) to a giant bird in the sky (Peng) from Chinese mythology, reflecting the flexible nature and capacity of the software to efficiently navigate the vast and complex landscapes of metagenomic data.
Follow these steps to install Kun-peng and run the examples.
If you prefer not to build from source, you can download the pre-built binaries for your platform from the GitHub releases page.
mkdir kun_peng_v0.6.10
tar -xvf Kun-peng-v0.6.10-centos7.tar.gz -C kun_peng_v0.6.10
# Add environment variable
echo 'export PATH=$PATH:~/biosoft/kun_peng_v0.6.10' >> ~/.bashrc
source ~/.bashrc
kun_peng
exampleWe will use a very small virus database on the GitHub homepage as an example:
git clone https://github.com/eric9n/Kun-peng.git
cd kun_peng
kun_peng build --download-dir data/ --db test_database
merge fna start...
merge fna took: 29.998258ms
estimate start...
estimate count: 14080, required capacity: 31818.0, Estimated hash table requirement: 124.29KB
convert fna file "test_database/library.fna"
process chunk file 1/1: duration: 29.326627ms
build k2 db took: 30.847894ms
# temp_chunk is used to store intermediate files
mkdir temp_chunk
# test_out is used to store output files
mkdir test_out
kun_peng classify --db test_database --chunk-dir temp_chunk --output-dir test_out data/COVID_19.fa
hash_config HashConfig { value_mask: 31, value_bits: 5, capacity: 31818, size: 13051, hash_capacity: 1073741824 }
splitr start...
splitr took: 18.212452ms
annotate start...
chunk_file "temp_chunk/sample_1.k2"
load table took: 548.911µs
annotate took: 12.006329ms
resolve start...
resolve took: 39.571515ms
Classify took: 92.519365ms
First, clone this repository to your local machine:
git clone https://github.com/eric9n/Kun-peng.git
cd kun_peng
Ensure that both projects are built. You can do this by running the following command from the root of the workspace:
cargo build --release
This will build the kr2r and ncbi project in release mode.
kun_peng
exampleNext, run the example script that demonstrates how to use the kun_peng
binary. Execute the following command from the root of the workspace:
cargo run --release --example build_and_classify --package kr2r
This will run the build_and_classify.rs example located in the kr2r project's examples directory.
Example Output You should see output similar to the following:
Executing command: /path/to/workspace/target/release/kun_peng build --download-dir data/ --db test_database
kun_peng build output: [build output here]
kun_peng build error: [any build errors here]
Executing command: /path/to/workspace/target/release/kun_peng direct --db test_database data/COVID_19.fa
kun_peng direct output: [direct output here]
kun_peng direct error: [any direct errors here]
This output confirms that the kun_peng
commands were executed successfully and the files were processed as expected.
ncbi
ExampleRun the example script in the ncbi project to download the necessary files. Execute the following command from the root of the workspace:
cargo run --release --example run_download --package ncbi
This will run the run_download.rs example located in the ncbi project's examples directory. The script will:
Example Output You should see output similar to the following:
Executing command: /path/to/workspace/target/release/ncbi -d /path/to/workspace/downloads gen -g archaea
NCBI binary output: [download output here]
Executing command: /path/to/workspace/target/release/ncbi -d /path/to/workspace/downloads tax
NCBI binary output: [download output here]
The ncbi binary is used to download resources from the NCBI website. Here is the help manual for the ncbi binary:
./target/release/ncbi -h
ncbi download resource
Usage: ncbi [OPTIONS] <COMMAND>
Commands:
taxonomy Download taxonomy files from NCBI (alias: tax)
genomes Download genomes data from NCBI (alias: gen)
help Print this message or the help of the given subcommand(s)
Options:
-d, --download-dir <DOWNLOAD_DIR> Directory to store downloaded files [default: lib]
-n, --num-threads <NUM_THREADS> Number of threads to use for downloading [default: 20]
-h, --help Print help (see more with '--help')
-V, --version Print version
Usage: kun_peng <COMMAND>
Commands:
estimate estimate capacity
build build `k2d` files
hashshard Convert Kraken2 database files to Kun-peng database format for efficient processing and analysis.
splitr Split fast(q/a) file into ranges
annotate annotate a set of sequences
resolve resolve taxonomy tree
classify Integrates 'splitr', 'annotate', and 'resolve' into a unified workflow for sequence classification. classify a set of sequences
direct Directly load all hash tables for classification annotation
merge-fna A tool for processing genomic files
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Build the kun_peng database like Kraken2, specifying the directory for the data files downloaded from NCBI, as well as the database directory.
./target/release/kun_peng build -h
build database
Usage: kun_peng build [OPTIONS] --download-dir <DOWNLOAD_DIR> --db <DATABASE>
Options:
-d, --download-dir <DOWNLOAD_DIR>
Directory to store downloaded files
--db <DATABASE>
ncbi library fna database directory
-k, --k-mer <K_MER>
Set length of k-mers, k must be positive integer, k=35, k cannot be less than l [default: 35]
-l, --l-mer <L_MER>
Set length of minimizers, 1 <= l <= 31 [default: 31]
--minimizer-spaces <MINIMIZER_SPACES>
Number of characters in minimizer that are ignored in comparisons [default: 7]
-T, --toggle-mask <TOGGLE_MASK>
Minimizer ordering toggle mask [default: 16392584516609989165]
--min-clear-hash-value <MIN_CLEAR_HASH_VALUE>
-r, --requested-bits-for-taxid <REQUESTED_BITS_FOR_TAXID>
Bit storage requested for taxid 0 <= r < 31 [default: 0]
-p, --threads <THREADS>
Number of threads [default: 10]
--cache
estimate capacity from cache if exists
--max-n <MAX_N>
Set maximum qualifying hash code [default: 4]
--load-factor <LOAD_FACTOR>
Proportion of the hash table to be populated (build task only; def: 0.7, must be between 0 and 1) [default: 0.7]
-h, --help
Print help
-V, --version
Print version
This tool converts Kraken2 database files into Kun-peng database format for more efficient processing and analysis. By specifying the database directory and the hash file capacity, users can control the size of the resulting database index files.
./target/release/kun_peng hashshard -h
Convert Kraken2 database files to Kun-peng database format for efficient processing and analysis.
Usage: kun_peng hashshard [OPTIONS] --db <DATABASE>
Options:
--db <DATABASE> The database directory for the Kraken 2 index. contains index files(hash.k2d opts.k2d taxo.k2d)
--hash-capacity <HASH_CAPACITY> Specifies the hash file capacity.
Acceptable formats include numeric values followed by 'K', 'M', or 'G' (e.g., '1.5G', '250M', '1024K').
Note: The specified capacity affects the index size, with a factor of 4 applied.
For example, specifying '1G' results in an index size of '4G'.
Default: 1G (capacity 1G = file size 4G) [default: 1G]
-h, --help Print help
-V, --version Print version
The classification process is divided into three modes:
bash cal_memory.sh $database_dir
Command Help
./target/release/kun_peng direct -h
Directly load all hash tables for classification annotation
Usage: kun_peng direct [OPTIONS] --db <DATABASE> [INPUT_FILES]...
Arguments:
[INPUT_FILES]... A list of input file paths (FASTA/FASTQ) to be processed by the classify program. Supports fasta or fastq format files (e.g., .fasta, .fastq) and gzip compressed files (e.g., .fasta.gz, .fastq.gz)
Options:
--db <DATABASE>
database hash chunk directory and other files
-P, --paired-end-processing
Enable paired-end processing
-S, --single-file-pairs
Process pairs with mates in the same file
-Q, --minimum-quality-score <MINIMUM_QUALITY_SCORE>
Minimum quality score for FASTQ data [default: 0]
-T, --confidence-threshold <CONFIDENCE_THRESHOLD>
Confidence score threshold [default: 0]
-K, --report-kmer-data
In comb. w/ -R, provide minimizer information in report
-z, --report-zero-counts
In comb. w/ -R, report taxa w/ 0 count
-g, --minimum-hit-groups <MINIMUM_HIT_GROUPS>
The minimum number of hit groups needed for a call [default: 2]
-p, --num-threads <NUM_THREADS>
The number of threads to use [default: 10]
--output-dir <KRAKEN_OUTPUT_DIR>
File path for outputting normal Kraken output
-h, --help
Print help (see more with '--help')
-V, --version
Print version
Command Help
./target/release/kun_peng classify -h
Integrates 'splitr', 'annotate', and 'resolve' into a unified workflow for sequence classification. classify a set of sequences
Usage: kun_peng classify [OPTIONS] --db <DATABASE> --chunk-dir <CHUNK_DIR> [INPUT_FILES]...
Arguments:
[INPUT_FILES]... A list of input file paths (FASTA/FASTQ) to be processed by the classify program. Supports fasta or fastq format files (e.g., .fasta, .fastq) and gzip compressed files (e.g., .fasta.gz, .fastq.gz)
Options:
--db <DATABASE>
--chunk-dir <CHUNK_DIR>
chunk directory
--output-dir <KRAKEN_OUTPUT_DIR>
File path for outputting normal Kraken output
-P, --paired-end-processing
Enable paired-end processing
-S, --single-file-pairs
Process pairs with mates in the same file
-Q, --minimum-quality-score <MINIMUM_QUALITY_SCORE>
Minimum quality score for FASTQ data [default: 0]
-p, --num-threads <NUM_THREADS>
The number of threads to use [default: 10]
--buffer-size <BUFFER_SIZE>
[default: 16777216]
--batch-size <BATCH_SIZE>
The size of each batch for processing taxid match results, used to control memory usage
[default: 16]
-T, --confidence-threshold <CONFIDENCE_THRESHOLD>
Confidence score threshold [default: 0]
-g, --minimum-hit-groups <MINIMUM_HIT_GROUPS>
The minimum number of hit groups needed for a call [default: 2]
--kraken-db-type
Enables use of a Kraken 2 compatible shared database
-K, --report-kmer-data
In comb. w/ -R, provide minimizer information in report
-z, --report-zero-counts
In comb. w/ -R, report taxa w/ 0 count
-h, --help
Print help (see more with '--help')
-V, --version
Print version
Standard Kraken Output Format:
|:|
” token in this list to indicate the end of one read and the beginning of another.100.00 1 0 R 1 root
100.00 1 0 D 10239 Viruses
100.00 1 0 D1 2559587 Riboviria
100.00 1 0 O 76804 Nidovirales
100.00 1 0 O1 2499399 Cornidovirineae
100.00 1 0 F 11118 Coronaviridae
100.00 1 0 F1 2501931 Orthocoronavirinae
100.00 1 0 G 694002 Betacoronavirus
100.00 1 0 G1 2509511 Sarbecovirus
100.00 1 0 S 694009 Severe acute respiratory syndrome-related coronavirus
100.00 1 1 S1 2697049 Severe acute respiratory syndrome coronavirus 2
Sample Report Output Formats:
We compare results from Kun_peng with Kraken2 using the same database here. Two datasets were used: 1. PacBio CCS long metagenomic reads from human gut sample (1); 2. Illumina shotgun metagenomic reads from oxygen minimum zone sample (depth 302m) in the ocean (NCBI project number PRJNA1124864), which is a less studied system. The following scripts can be used to reproduce the plots below.
### use scripts from KrakenTools
git clone https://github.com/jenniferlu717/KrakenTools.git
python ./KrakenTools/kreport2krona.py -r output_1_report.txt -o output_1_report.krona
### install Krona software first: https://github.com/marbl/Krona
ktImportText output_1_report.krona -o output_1_report.krona.html
Results for the human gut sample from Kun_peng: Results for the human gut sample from Kraken2:
Results for the OMZ sample from Kun_peng (classfied reads only): Results for the OMZ sample from Kraken2 (classfied reads only):
Interactive results can be found in the benchmark folder (html files can be viewed in a browser).