8/25/19 - Fixed stats and added stats to groups. Still a little different than how umi_tools does things, but I think it's reasonable - ToDo's benchmark on some large datasets - Clean up the code a bit - Set it free 8/23/19 - First pass at adding stats, very much broken on the reads_unmapped 8/22/19 - Added first pass at support for paired end reads. - Next, add tests for paired end reads? maybe not worth it - Investigate diffs with umi_tools, expecially around default settings for skipping / tlen - Collect some stats similar to umi_tools 8/21/19 - Added test to make sure that the determine_umi step couldn't assign a umi to multiple masters, which results in the read showing up twice in the group_only option. - Restored rayon for run_dedup and run_group - Check that the groups are correct still in their numbering 8/20/19 - Working on finding why rumi gets more reads on the example.bam file than umi_tools does. It seems like umi_tools is in some cases pulling reads that are dist 3 away into groups where they may not belong. The following is the ouput of compare_reads.pl ```bash $ perl ./scripts/compare_reads.pl (samtools view /mnt/d/dev/UMI-tools/tests_out/example_umitools.bam | psub) (samtools view /mnt/d/dev/UMI-tools/tests_out/example_rumi_deduped.bam | psub) (samtools view /mnt/d/dev/UMI-tools/tests_out/example_group.bam|psub) (samtools view /mnt/d/dev/UMI-tools/tests_out/example_rumi.bam | psub) FOUND: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 BX:Z:TTTGGTTTA UG:i:15127 EXPECTED: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TGTGGTTAC UG:i:15122 EXPECTED: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 BX:Z:GCCGGTTTT UG:i:13709 EXPECTED: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 UG:i:13685 BX:Z:GTAGGTTTC FOUND: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GCAGGTTTA UG:i:15129 EXPECTED: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AAGGGTTAT UG:i:15125 EXPECTED: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA FOUND: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:ATAGGTTTC UG:i:15128 EXPECTED: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA FOUND: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GGAGGTTCT UG:i:15130 EXPECTED: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TTGGGTTAA UG:i:10683 EXPECTED: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10632 BX:Z:TCAGGTTCA FOUND: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 BX:Z:GCTGGTTAT UG:i:3777 EXPECTED: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 UG:i:3735 BX:Z:ATGGGTTAT FOUND: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 BX:Z:TCGGGTTAC UG:i:2555 EXPECTED: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 UG:i:2527 BX:Z:TTCGGTTGC FOUND: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 BX:Z:AACGGTTGG UG:i:412 EXPECTED: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 UG:i:376 BX:Z:ATTGGTTCG FOUND: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 BX:Z:AAAGGTTCC UG:i:3102 EXPECTED: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 UG:i:3097 BX:Z:GTAGGTTAC FOUND: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 BX:Z:GGGGGTTGT UG:i:10686 EXPECTED: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 UG:i:10625 BX:Z:CTGGGTTGA FOUND: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TCGGGTTGG UG:i:15110 EXPECTED: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 BX:Z:AATGGTTAC UG:i:15687 EXPECTED: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 UG:i:15646 BX:Z:TCTGGTTTC FOUND: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 BX:Z:TTTGGTTGA UG:i:3776 EXPECTED: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 UG:i:3749 BX:Z:ATTGGTTCG FOUND: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:CAAGGTTAA UG:i:15120 EXPECTED: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TATGGTTGG UG:i:10684 EXPECTED: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10631 BX:Z:CATGGTTCT FOUND: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 BX:Z:CAAGGTTGA UG:i:10268 EXPECTED: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 UG:i:10172 BX:Z:GGAGGTTAA FOUND: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 BX:Z:TCCGGTTCA UG:i:558 EXPECTED: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 UG:i:517 BX:Z:CACGGTTTA FOUND: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AGTGGTTCT UG:i:15134 EXPECTED: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TTGGGTTAC UG:i:15137 EXPECTED: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 BX:Z:ATGGTTCTT UG:i:10287 EXPECTED: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 UG:i:10267 BX:Z:ACGGTTACT FOUND: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 BX:Z:TTTGGTTTA UG:i:15127 EXPECTED: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TGTGGTTAC UG:i:15122 EXPECTED: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 BX:Z:GCCGGTTTT UG:i:13709 EXPECTED: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 UG:i:13685 BX:Z:GTAGGTTTC FOUND: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GCAGGTTTA UG:i:15129 EXPECTED: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AAGGGTTAT UG:i:15125 EXPECTED: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA FOUND: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:ATAGGTTTC UG:i:15128 EXPECTED: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA FOUND: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GGAGGTTCT UG:i:15130 EXPECTED: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TTGGGTTAA UG:i:10683 EXPECTED: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10632 BX:Z:TCAGGTTCA FOUND: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 BX:Z:GCTGGTTAT UG:i:3777 EXPECTED: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 UG:i:3735 BX:Z:ATGGGTTAT FOUND: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 BX:Z:TCGGGTTAC UG:i:2555 EXPECTED: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 UG:i:2527 BX:Z:TTCGGTTGC FOUND: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 BX:Z:AACGGTTGG UG:i:412 EXPECTED: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 UG:i:376 BX:Z:ATTGGTTCG FOUND: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 BX:Z:AAAGGTTCC UG:i:3102 EXPECTED: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 UG:i:3097 BX:Z:GTAGGTTAC FOUND: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 BX:Z:GGGGGTTGT UG:i:10686 EXPECTED: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 UG:i:10625 BX:Z:CTGGGTTGA FOUND: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TCGGGTTGG UG:i:15110 EXPECTED: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 BX:Z:AATGGTTAC UG:i:15687 EXPECTED: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 UG:i:15646 BX:Z:TCTGGTTTC FOUND: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 BX:Z:TTTGGTTGA UG:i:3776 EXPECTED: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 UG:i:3749 BX:Z:ATTGGTTCG FOUND: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:CAAGGTTAA UG:i:15120 EXPECTED: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TATGGTTGG UG:i:10684 EXPECTED: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10631 BX:Z:CATGGTTCT FOUND: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 BX:Z:CAAGGTTGA UG:i:10268 EXPECTED: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 UG:i:10172 BX:Z:GGAGGTTAA FOUND: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 BX:Z:TCCGGTTCA UG:i:558 EXPECTED: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 UG:i:517 BX:Z:CACGGTTTA FOUND: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AGTGGTTCT UG:i:15134 EXPECTED: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TTGGGTTAC UG:i:15137 EXPECTED: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA FOUND: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 BX:Z:ATGGTTCTT UG:i:10287 EXPECTED: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 UG:i:10267 BX:Z:ACGGTTACT ``` This accounts for 23 of the 26 extra reads from rumi. I don't know why umi_tools does this. I'm betting the other missing reads are of a similar vein - SRR2057595.3354975_CGGGTTGGT: rumi correctly uses the umi starting with C since there are two reads with that umi. umi_tools uses the umi with only a feq of 1. - SRR2057595.4915638_TTGGTTAAA: rumi correctly chooses the read with the decided upon umi as the best read. - SRR2057595.5405752_AACGGTTGG: rumi correctly leaves as it's own group. umi_tools corrects it dist 3 away to ATTGGTTCG. I expect this to be the end source of the 30 extra reads in rumi's output. What causes this in umi_tools? - Next up is to add a test for making sure that reads aren't doubled up on in the determine_umi step. Then make that step better and faster. - At this point I feel pretty confident in the calls that rumi makes. 8/18/19 - Updated tests to work with new BTreeMap structure and read_groups types - Took second pass at a group_only option. Currently all reads are being collapsed in the group_reads function. Need to figure out how to keep them around, ideally without compromising the performance of the dedup procedure itself. group_only can be slow. dedup must be fast Later the same da - I have an extra 3000 reads for no reason I can figure you. Also chrY is being ordered weird ... - The determin_umi step was double adding some reads. That has been fixed, but it's an inefficient fix. - group_only now ouptus the right number of reads. The diffs between the two should help figure out the remaining diffs in the dedup only. 8/17/19 - Update tests to work with new read_groups types and BTreeMap - Then I need to add the UG and BX tags to them for the group_only option - I think that my missing reads stem from positional grouping, not from the directional adjacnecy. Find a way to test this?