# Import bioinformatic formats - Protein FASTA - List (string, AA_Seq) - DNA FASTA - List (string, DNA_Seq) - FASTQ - List (string, record, quality) - GB - (string, ) - BED or BED like: BED, BED graph, BigBed - List (Coordinates, [ other info ] - Coords: (chrom, start, stop) - VCF probably falls into this category - not sure how bedgraph fits in there - GFF3 probably falls into the above category - how to think about SAM? General principles: - ~~Design a grammar for each spec~~ - There seem to be two cases here 1) The file format has a spec (BAM, vcf) 2) The file format does not have a spec - A grammar for #1 is thus unnecessary, and the a grammar for #2 violates Postel's Law (seemingly) - The other issue i'm concerned about is whether or not the error recovery will be good enough - A grammar that fails in some obtuse way is not useful - Finally, these formats need to be streamed in most cases - If something goes off-spec, try to flag it but proceed - Postel's Law, Postel's Law - Stream it if you can - Read the first 1000 lines (or X number of bytes, whichever comes first) or so, and then go from there - Trait-based polymorphism - Reference - Features (iterable) - Records - Reads? (How is this different than a sequence?) FileData is a &str for now, but it can be something different in the future (e.g. it can be iterable with the first N lines or bytes cached) FileData -> peek -> FileData, Format -> Box Box is there a way to try and convert this at run time? fn align() PEEK functions are used for _inference_ purposes. For instance, just because peek_gb(file) is true, doesn't mean it's completely syntactically valid (it only checks the LOCUS line). Similarly, peek_bed(file) only returns true if the files are indisputably in a BED file. (blank files and comment-only files trivially return True for BED files)