# Data
Many testing and benchmark programs require large files of sequence data
that should be placed in this directory.

Below are instructions for how to download the necessary data. Make sure
you are in this directory (`cd data`).

## 25kbp Nanopore data
This data is from the difference recurrence [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2014-8)
by Suzuki and Kasahara.

1. `curl -OL https://github.com/Daniel-Liu-c0deb0t/diff-bench-paper/releases/download/v1.0/sequences.txt.gz`
2. `gunzip sequences.txt.gz`

Since these reads are filtered to only have gaps smaller than 20bp, it is not representative of typical reads. Therefore,
this dataset will be rarely used.

## \<10kbp and \<50kbp Nanopore data
This data is from the BiWFA [repository](https://github.com/smarco/BiWFA-paper/tree/main/evaluation/data)
and reformatted.

1. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/seq_pairs.10kbps.5000.txt.gz`
2. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/seq_pairs.50kbps.10000.txt.gz`
3. `gunzip seq_pairs.10kbps.5000.txt.gz`
4. `gunzip seq_pairs.50kbps.10000.txt.gz`

These files contain pairs of reads that are alignable.

## Illumina and 1kbp Nanopore data
This data is from the Wavefront Aligner [paper](https://academic.oup.com/bioinformatics/article/37/4/456/5904262).

1. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/real.illumina.b10M.txt.gz`
2. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/real.ont.b10M.txt.gz`
3. `gunzip real.illumina.b10M.txt.gz`
4. `gunzip real.ont.b10M.txt.gz`

The Illumina, 1kbp Nanopore, and 25kbp Nanopore datasets are just a list of reads, where every two reads
form a pair that is alignable.

## Uniclust30 data
This data is generated with [mmseqs2](https://github.com/soedinglab/MMseqs2)
and the [Uniclust30](https://uniclust.mmseqs.com/) dataset.
Two datasets with two different coverages percentages are used: `0.8`
(default in `mmseqs2`) and `0.95`. Using a higher coverage helps gather
sequences that are "globally alignable", as `mmseqs2` uses local alignment.
The dataset with the lower coverage percent is expected to be more challenging.

Scripts for generating the data: [`0.8` coverage](uc30_pairwise_aln.sh)
and [`0.95` coverage](uc30_0.95_pairwise_aln.sh).

1. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/uc30.tar.gz`
2. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/uc30_0.95.tar.gz`
3. `tar -xvf uc30.tar.gz`
4. `tar -xvf uc30_0.95.tar.gz`

## SCOP PSSM data
This data is generated with `mmseqs2` and the [SCOPe](https://scop.berkeley.edu/astral/ver=2.01) dataset.
This data is used for aligning sequences to profiles (position-specific scoring matrices) of protein domains.

1. `mkdir scop && cd scop`
2. `curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/scop.tar.gz`
3. `tar -xvf scop.tar.gz`