# The Screed

bioform is a bioinformatics DSL for the everyday researcher.

This document contains a rant that I wrote that inspired the creation of this tool, as well as a few developement notes.

## The Problem(s)

Bioinformatics can be a challenging discipline to move into.
1. Things are constantly improving. It's hard to know what the right tool for the job is. 
2. Lots of domain-specific vocabulary (alignment, contig, unitig, variant, etc.)
3. Lots of reliance on web-based tools... this makes scripting harder!
4. Fragmented, diverse ecosystem:
    a) Many tools in many languages (C, C++, Java, Python)
    b) Many dependencies, it's hell trying to install anything
    c) Docker kills the composability of all of these tools
4. Non-GNU style command line parsers
    a) proper style is [binary] [subcommand] -flags --options 
    b) sometimes options are not double dashed
    c) poorly or excessively long help pages (blastn is 400 lines, but not super well formatted) 
    d) 
5. An affront to Postel's Law
    - last night, lost 1 hour due to poorly documented format 
    - a lot of this friction is a secondary effect of using so many interlocking tools
6. Implementation is too exposed ("Smith-Waterman", "BLAST")
    a) When I sort a list in python, I don't say "please timsort this" or "please mergesort" this. When I find my way on Google Maps, I don't say "please Djiksra's algorithm". We need an opinionated framework that makes the best choice by default (but allows advanced users a way to get under the hood). 
    b) excessive parameters on many algorithms: too high surface area
7. File-formats are a form of implementation
    a) file-extensions are brittle (gb, gbff, fa, fasta, FASTA, FAA) and useless (no command line program ever puts them to use that i can see)
    b) I shouldn't have to specify what each file is (it should be sniffed in <200ms)
        i) ok seriously, bioinformatic formats are so distinct
        ii) inspiration here comes from tidyverse CSV readers
    c) any file formats should have a grammar


Consider, as a motivating task, how long this takes.
There is at least protein containing the peptide "[fMIIINA](https://pubmed.ncbi.nlm.nih.gov/29358051/)" in one strain of Staph epi. Starting with a collection of 20 FASTA files (assemblies of the entire genome). 
How many occurrences of this peptide in the entire proteome? 
What does the multiple sequence alignment of these proteins look like?

Here's what I had to do:
with snapgene
1. open snapgene
2. turn on ORFs
3. ctrl+alt+f "MIIINA"
or with annotation software
1. run bakta (this took me the better part of a day due to some issues with crispr detection). (Bakta itself is a very well packaged for the most part and an admirable attempt to join several disparate pieces of software together. I mention this only to highlight how even the best tools can have hangnails.)
2. take a tsv-like file containing the predicted proteins, grep MIIINA
  
Now, there are many ways to continue (this itself is part of the problem).
Option 1: manual stuff with snapgene
4. repeat 1-3 for each genome, copy the protein sequence
5. make a separate protein FASTA file 
6. run clustal / similar using snapgene

Option 2: command line
4. repeat 1-2 with bakta for each genome
5. grep the MIIINA out of the tabular form of each annotated proteome
7. take that tabular form and convert it to a FASTA, because bioinformatics software is stupid
i could go on a diatribe about how stupid the FASTA format is, but I won't. ok i will. the FASTQ format makes slightly more sense, since it serves a more specialized purpose, but the FASTA format, for all its simplicity, completely betrays any attempt at command-line parsing. This is not to mention many variations in what the record name means. Does it need to be distinct? should you include a space between the ">" and the record name? can you record name contain whitespaces? use 5 different programs and you will find 5 different treatments of the record name.

Option 3: blast db
4. make a blast db (nucleotide or protein) with your genomes (another thing to install)
5. query that blast db with one version of the fMIIINA containing analog

It took me 3 or 4 custom shell scripts, BLAST, and snapgene to figure this out? What's wrong with this ecosystem?


Some software does a heroic job of making pretty good unix-y interfaces:
1) htslib / bcftools are masterpieces
2) bedtools is also fantastic 


We want stuff to be tidy, but not in a way that compromises on optimizations.
However, so much is done for optimization that sacrifices other aspects of the user experience, and we should never cross that line.
(I'm specifically thinking of the experience I had with... I think it was progressiveCactus or something? That has an overblown CLI and required me to make a dummy file in my home directory to work)


Arguments against doing this

> Python is good enough.
No, it isn't. Biopython is designed in a very OOPy way that imposes a lot of overhead on users for relatively simple tasks. 

> Python is what people already know.
Your average researcher doing bioinformatics isn't writing particularly good Python code to begin with. Part of the reason why is that there are a million ways of doing things.

> People don't want to learn another language.
This is the real argument, and I obviously can't force people to. But my mission is to make something good enough that people will WANT to learn it!

> It will be slow.
It will be faster than Python.
Routine tasks can be handled using strong Rust code. Once you set aside the boilerplate of dealing with different formats and quirks, I think most bioinformatic objectives are just compositions of routine tasks. 


## Features
- static typing but with good type inference
    - can you make a language
- exactly the abstractions needed, and no more (data oriented design)
    - we're talking: strings, DNA, RNA, Protein sequences, floats, ints, and Lists / Maps collecting those two
- composability, composability, composability
    - pipe operator
    - Mapping? I want a way to represent workflows like:
```
     |---------|
     |---------|
-----|---------|----
     |---------|
     |---------|
```
- integrates well with bash scripts
- manifests a "tidy data" type ergonomic
- mullet language design: imperative in the front, functional in the back
- builtins contain most common bioinformatic tasks, even ones that are more computationally heavy 
- social scrapscript-style functions (not packages?)

## Development road map

1) file sniffer 
2) implementation of some basic algorithms (alignment, BLAST, assembly?, translation, motif finding)

## Future optimizations to note:
- idiosyncratic hashing algorithms for nucleotides
- fast kmer implementation: https://doi.org/10.1093/bioinformatics/btr011
- thoughts on concurrency
    - concurrency over multiple records is embarassingly paralle
    - concurrency within records gets hard very quickly
        - handling junctions = harder
            - with Kmers: junction overlaps are defined
                - 
            - with something that involves unlimited search: impossible
                - conceptually feels impossible with a traditional approach
- fast UTF8 checking:
    - https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fast_utf8_validation/?rdt=59336
- regex improvements:
    - for something like a FASTA, might be worth seeing if you can just fill the regex with optional whitespace between each character to cut down on parsing load and allocations
    - streaming: https://github.com/rust-lang/regex/issues/425
        - see: https://docs.rs/regex-cursor/
            - eventually going to be merged with regex main?
        - https://github.com/deadpixi/ergex
            - finished implementation, seems to fit my usecase, not super well maintained
- memory-mapped files?
    - I've played around with this and it didn't seem to make a tremendous difference (I'm already reading the whole file to a String), maybe I'm missing something

## Test files

FASTQ: https://github.com/hartwigmedical/testdata/blob/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz