Annotating sequence variants (SNV/indel/MNV). ## Sequence Variants And Their VCF Representation Sequence variants are small variants that are typically represented by the substitution of one or more bases in the reference genome. In a VCF file, this will typically look as follows: ```text #CHROM POS ID REF ALT QUAL FILTER INFO 17 41197700 . A C . . . 17 41197708 . TG T . . . ``` In the literature, there generally is made a distinction between sequence (or: small) variants and structural variants. There, small variants are those affecting up to 50bp while structural variants are those that are larger. Mehari does not enforce such a restriction but attempt to annotate all variants based on their VCF representation (chromosome, 1-based position, reference base and alternative bases). You can find the VCF variant specification [here on Github](https://samtools.github.io/hts-specs/). ## Sequence Variant Annotation Currently, Mehari will annotate variants using: - The predicted impact on gene transcripts and the corresponding protein sequence (in the case of coding genes). - Their frequency in the gnomAD exomes and genomes databases as well as the HelixMtDb database in the case of mitochondrial databases. - Variant information from ClinVar, if any ## Command Line Invocation You can invoke Mehari to annotate a VCF file `IN.vcf` creating an output file `OUT.vcf` using the built (or downloaded) databases – for example the transcript database – as follows: ```text $ mehari annotate seqvars \ --transcripts path/to/transcripts-db \ --path-input-vcf IN.vcf \ --path-output-vcf OUT.vcf ``` Note that the input and output files can optionally be gzip/bgzip compressed VCF files with suffixes (`.gz` or `.bgz`) or BCF files with suffix `.bcf`. The database genome build should match the one in the input VCF file (e.g., both should either be GRCh37/hg19 or GRCh38/hg38). ## Interpreting Annotation Output The variant effect/consequence will be formatted similar to the one in, following the `ANN` field standard [documented here](https://pcingola.github.io/SnpEff/se_inputoutput/#ann-field-vcf-output-files). The population frequency will be written to the VCF INFO fields as follows: - gnomAD exomes - `gnomad_exomes_an` -- number of observed alleles in gnomAD exomes - `gnomad_exomes_hom` -- number of homozygous carriers in gnomAD exomes - `gnomad_exomes_het` -- number of heterozygous carriers in gnomAD exomes - `gnomad_exomes_hemi` -- number of hemizygous carriers in gnomAD exomes - gnomAD genomes - `gnomad_genomes_an` -- number of observed alleles in gnomAD genomes - `gnomad_genomes_hom` -- number of homozygous carriers in gnomAD genomes - `gnomad_genomes_het` -- number of heterozygous carriers in gnomAD genomes - `gnomad_genomes_hemi` -- number of hemizygous carriers in gnomAD genomes - gnomAD mtDNA - `gnomad_mtdna_an` -- number of individuals with sufficient coverage in gnomAD mtDNA database - `gnomad_mtdna_hom` -- number of homoplasmic carriers in gnomAD mtDNA database - `gnomad_mtdna_het` -- number of heteroplasmic carriers in gnomAD mtDNA database - HelixMtDb - `helix_an` -- number of individuals in HelixMtDb - `helix_hom` -- number of homoplasmic carriers in HelixMtDb - `helix_het` -- number of heteroplasmic carriers in HelixMtDb From these integer numbers, the relative allele frequencies can be derived as needed.