# Test dataset This subdirectory contains a small dataset to allow tests on the UMGAP pipeline. The original data comes from [An evaluation of the accuracy and speed of metagenome analysis tools][metabenchmark] by Lindgreen, Adair & Gardner (2016). This article is also commonly referred to as the Metabenchmark. The data that is contained in this directory was randomly sampled (with respect to the different categories of organisms and their frequencies). The 2 data files represent paired-end reads. As such, the n'th record in `A1.fq` will originate from the same organism as the n'th record in `A2.fq`. ## Assigning the taxon ID Since the original dataset does not contain the taxon ID's of the originally sampled organisms, we prepended these at the beginning of each record header. This was done with the help of the `acc2taxonid.py` script. This script takes 2 arguments: the input FASTQ file and a file which maps accession ID's to taxon ID's. This file can be [downloaded][accession2taxid-file] from the appropriate NCBI [FTP directory][accestion2taxid-dir]. If a record contains a randomly shuffled nucleotide sequence, the script will assign it a taxon ID `0`. If the script does not recognize the header in any way, it will assign a taxon ID `-1`. [metabenchmark]: http://www.ucbioinformatics.org/metabenchmark.html [accession2taxid-file]: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz [accestion2taxid-dir]: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/