# Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG) This is a port of crate [colorid](https://github.com/hcdenbakker/colorid) with several updates for real-world application; 1. Use [xxh3](https://crates.io/crates/xxh3) to suport aarch64 and x86-64 platforms; 2. Use [needletail](https://crates.io/crates/needletail) for fast and compressed fasta/fastq file processing; 3. 2-bit nucleitide sequence representation via [kmerutils](https://crates.io/crates/kmerutils) to improve memory efficiency; 4. Recreate the command line interface using recent [clap](https://crates.io/crates/clap) v4.3. Credit for orginal implementation to original authors. ## Install ```bash git clone https://gitlab.com/Jianshu_Zhao/bigsig cd bigsig cargo build --release ``` ## Usage ```bash ************** initializing logger ***************** bigsig 0.1.0 Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG) USAGE: bigsig [SUBCOMMAND] FLAGS: -h, --help Prints help information -V, --version Prints version information SUBCOMMANDS: batch_identify Identify batch of samples reads construct Construct a BIGSIG filter filters reads help Prints this message or the help of the given subcommand(s) identify identify reads based on probability query query a bigsig on one or more fasta/fastq.gz files show show index parameters ``` An example to build and query BigSig database ```bash bigsig construct -r ref_file_example.txt -b test -k 31 -mv 21 -s 10000000 -n 4 -t 24 bigsig query -b ./test.mxi -q ./test_data/test.fastq.gz bigsig identify -b test.mxi -q ./test_data/test.fastq.gz -n output -t 24 --high_mem_load ``` ## Results With the default settings BigSiq will report reference sequences that share >35% of their k-mers with the query. Here is the output of a query with SRA accession SRR4098796 (L. monocytogenes lineage I) as query: ``` SRR4098796_1.fastq.gz 3076072 Listeria_monocytogenes_F2365 0.87 134.25 126 475266 SRR4098796_1.fastq.gz 3076072 Listeria_monocytogenes_SRR2167842 0.40 128.25 122 7831 ``` In the first column we find the query, the second column shows the number of k-mers in the query, the third column displays the reference sequence, the fourth column the proportion of kmers in the reference shared with the query, the fifth column displays the average coverage based on k-mers that were uniquely matched with this reference, the sixth the modus of the coverage based on uniquely matched k-mers and the last column the number of uniquely matched k-mers. ## Reference 1. Bradley, Phelim, et al. "Ultrafast search of all deposited bacterial and viral genomic data." Nature biotechnology 37.2 (2019): 152-159. 2. Bingmann, Timo, et al. "COBS: a compact bit-sliced signature index." String Processing and Information Retrieval: 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7–9, 2019, Proceedings 26. Springer International Publishing, 2019.