[![Crates.io](https://img.shields.io/crates/d/varfish-server-worker.svg)](https://crates.io/crates/varfish-server-worker) [![CI](https://github.com/bihealth/varfish-server-worker/actions/workflows/rust.yml/badge.svg)](https://github.com/bihealth/varfish-server-worker/actions/workflows/rust.yml) [![codecov](https://codecov.io/gh/bihealth/varfish-server-worker/branch/main/graph/badge.svg?token=t5oheMdukf)](https://codecov.io/gh/bihealth/varfish-server-worker) [![DOI](https://zenodo.org/badge/590461457.svg)](https://zenodo.org/badge/latestdoi/590461457) # VarFish Server Worker > [!NOTE] > This repository contains code that runs inside a VarFish Server. > If you are looking to run your own VarFish Server, look [here at bihealth/varfish-server](https://github.com/bihealth/varfish-server). This repository contains the worker used by VarFish Server to execute certain background task. They are written in the Rust programming language to speed up the execution of certain tasks. At the moment, the following sub commands exist: - `db` -- subcommands to build binary (protobuf) database files - `seqvars` -- subcommands for processing sequence (aka small/SNV/indel) variants - `seqvars ingest` -- convert single VCF file into internal format for use with `seqvars query` - `seqvars query` -- perform sequence variant filtration and on-the-fly annotation - `seqvars prefilter` -- limit the result of `seqvars prefilter` by population frequency and/or distance to exon - `seqvars aggregate` -- read through multiple VCF files written by `seqvars ingest` and computes a carrier counts table. - `strucvars` -- subcommands for processing structural (aka large variants, CNVs, etc.) variants - `strucvars ingest` -- convert one or more structural variant files for use with `strucvars query` - `strucvars aggregate` -- compile per-case structural variant into an in-house database, to be converted to `.bin` with `strucvars txt-to-bin`. - `strucvars txt-to-bin` -- convert text files downloaded by [varfish-db-downloader](https://github.com/bihealth/varfish-db-downloader/) to binary for fast use in `strucvars query` commands - `strucvars query` -- perform structural variant filtration and on-the-fly annotation ## Overall Design For running queries, the worker tool is installed into the VarFish Server image and are run as executables. Internally, VarFish server works on VCF files stored in an S3 storage. For import, the user gives the server access to the VCF files to import. The server will then use the worker executable to ingest the data into the internal format using `{seqvars,strucvars} ingest`. These files are then stored in the internal S3 storage. For queries, the server will create a query JSON file and then pass this query JSON file together with the internal file to the worker executable. The worker will create a result file that can be directly imported by the server to be displayed to the user. Future versions may provide persistently running HTTP/REST servers that provide functionality without startup cost. ## The `seqvars ingest` Command This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying. The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen. - `FORMAT/GT` -- genotype - the following `GT` values are written out as `0/0`, `0/1`, `1/0`, `1/1`, `0|0`, `0|1`, `1|0`, `1|1`, `./.`, `.|.`, `.` - no combination of no-call (`.`) and called allele is written out - `FORMAT/GQ` -- genotype quality - `FORMAT/DP` -- total read coverage - `FORMAT/AD` -- allelic depth, one value per allele (including reference0) - `FORMAT/PS` -- physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow and Dragen variant caller - `FORMAT/SQ` -- "somatic quality" for each alternate allele, as written out by Illumina Dragen variant caller - this field will be written as `FORMAT/GQ` The `seqvars ingest` command will annotate the variants with the following information: - gnomAD genomes and exomes allele frequencies - gnomAD-mtDNA and HelixMtDb allele frequencies - functional annotation following the [VCF ANN field standard](https://pcingola.github.io/SnpEff/adds/VCFannotationformat_v1.0.pdf) - `Gene_Name` is writen as HGNC symbol - `Gene_ID` is written as HGNC ID The command will emit one output line for each variant allele from the input and each affected gene. That is, if two variant alleles affect two genes, four records will be written to the output file. The annotation will be written out for one highest impact. Overall, the command will emit the following header rows in addition to the `##contig=` lines. ``` ##fileformat=VCFv4.2 ##FILTER= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c ##x-varfish-version= ##x-varfish-version= ##x-varfish-version= ``` > [!NOTE] > The gnomad-mtDNA information is written to the `INFO/gnomdad_genome_*` fields. > [!NOTE] > Future versions of the worker will annotate the worst effect on a MANE select or MANE Clinical transcript. ## The `seqvars prefilter` Command This file takes as the input a file created by `seqvars ingest` and filters the variants by population frequency and/or distance to exon. You can pass the prefilter criteria as JSON on the command line corresponding to the following Rust structs: ```rust struct PrefilterParams { /// Path to output file. pub path_out: String, /// Maximal allele population frequency. pub max_freq: f64, /// Maximal distance to exon. pub max_dist: i32, } ``` You can either specify the parameters on the command line directly or pass a path to a JSONL file starting with `@`. You can mix both ways. ``` $ varfish-server-worker strucvars prefilter \ --path-input INPUT.vcf \ --params '{"path_out": "out.vcf", "max_freq": 0.01, "max_dist": 100}' \ [--params ...] \ # OR $ varfish-server-worker strucvars prefilter \ --path-input INPUT.vcf \ --params @path/to/params.json \ [--params ...] \ ## The `seqvars aggregate` Command This command reads through multiple files written by `seqvars ingest` and computes a in-house carrier counts table. You can specify the VCF files individually on the command line or pass in files that have paths to the VCF files line by line. The resulting table is a folder to a RocksDB database. ```shell session varfish-server-worker seqvars aggregate \ --genome-build {grch37,grch38} \ --path-out-rocksdb rocksdb/folder \ --path-in-vcf path/to/vcf.gz \ --path-in-vcf @path/to/file/list.txt ``` ## The `seqvars query` Command This command perform the querying of sequence variants and further annotation using annonars databases. ## The `strucvars ingest` Command This command takes as the input one or more VCF files from structural variant callers and converts it into a file for further querying. The command supports the following variant callers and can guess the caller from the VCF header and first record. - Delly2 - Dragen-SV (equivalent to Manta) - Dragen-CNV - GATK gCNV - Manta - MELT - PopDel - Sniffles2 One record will be written out for each variant, each with a single alternate allele. The following symbolic `ALT` alleles are used: - `` - `` - `` - `` - VCF break-end syntax, e.g., `T[chr1:5[` The following `INFO` fields are written: - `IMPRECISE` -- flag that specifies that this is an imprecise variant - `END` -- end position of the variants - `SVTYPE` -- type of the variant, one of ``, ``, ``, ``, `BND` - `SVLEN` -- absolute length of the SV for linear variants, `.` for non-linear variants - `SVCLAIM` -- specificaton of `D` (change in abundance), `J` (novel junction), or `DJ` (both change in abundance and novel junction) - `callers` -- (non-standard field), list of callers that called the variant - `chr2` -- (non-standard field), second chromosome for BND variants - `annsv` -- (non-standard field), annotation of the variant effect on each affected gene The `annsv` field is a pipe-character (`|`) separated list of the following fields: 1. symbolic alternate alele, e.g., `` 2. effects on the gene's transcript, separated by `&` - `transcript_variant` -- variant affects the whole transcript - `exon_variant` -- variant affects exon - `splice_region_variant` -- variant affects splice region - `intron_variant` -- variant affects only intron - `upstream_variant` -- variant upsream of gene - `downstream_variant` -- variant downstream of gene - `intergenic_variant` -- default for "no gene affected", but never written 3. HGNC gene symbol, e.g., `BRCA1` 4. HGNC gene ID, e.g., `HGNC:1100` The following `FORMAT` fields are written: - `GT` -- (standard field) genotype, if applicable - `GQ` -- (standard field) genotype quality, if applicable - `pec` -- total coverage with paired-end reads - `pev` -- paired-end reads supporting the variant - `src` -- total coverage with split reads - `srv` -- split reads supporting the variant - `amq` -- average mapping quality over the variant - `cn` -- copy number of the variant in the sample - `anc` -- average normalized coverage over the variant in the sample - `pc` -- point count (windows/targets/probes) Overall, the command will emit the following header rows in addition to the `##contig=` lines. ``` ##fileformat=VCFv4.4 ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##ALT= ##ALT= ##ALT= ##ALT= ##ALT= ##fileDate=20230421 ##x-varfish-genome-build=GRCh37 ##SAMPLE= ##SAMPLE= ##SAMPLE= ##PEDIGREE= ##PEDIGREE= ##PEDIGREE= ##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c ##x-varfish-version= ##x-varfish-version= ``` > [!NOTE] > The `strucvars ingest` step does not perform any annotation. > It only merges the input VCF files from multiple callers (all files must have the same samples) and converts them into the internal format. > The `INFO/annsv` field is filled by `strucvars query`. ## The `strucvars aggregate` Command Import multiple files created by `strucvars ingest` into a database that can be convered to `.bin` with `strucvars txt-to-bin` and then used by `strucvars query`. You can specify the files individually. Paths starting with an at (`@`) character are interpreted as files with lists of paths. You can mix paths with `@` and without. ``` $ varfish-server-worker strucvars aggregate \ --genome-release {Grch37,Grch38} \ --path-output OUT.tsv \ --path-input IN/file1.vcf.gz \ [--path-input IN/file1.vcf.gz] \ # OR: $ varfish-server-worker db mk-inhouse \ --genome-release {Grch37,Grch38} \ --path-output OUT.tsv \ --path-input @IN/path-list.txt \ [--path-input @IN/path-list2.txt] ``` ## The `strucvars txt-to-bin` Command Convert output of [varfish-db-downloader](https://github.com/bihealth/varfish-db-downloader/) to a directory with databases to be used by query commands such as `strucvars query`. ``` $ varfish-server-worker strucvars txt-to-bin \ --input-type {ClinvarSv,StrucvarInhouse,...} \ --path-input IN.txt \ --path-output DST.bin ``` ## The `strucvars query` Command Run a query on a VCF file with structural variants as created by `strucvars ingest` using a varfish worker database. ``` $ varfish-server-worker strucvars query \ --genome-release grch37 \ --path-db path/to/worker-db \ --path-input IN.vcf.gz \ --path-output OUT.jsonl ``` The worker database has the following structure. Note that also mehari transcripts are read, thus the `mehari/` directory is included. ``` mehari/ {genome_release}/ txs.bin.zst worker/ noref/ genes/ acmg.tsv -- ACMG SF list genes mim2gene.tsv -- OMIM to NCBI mapping from clingen xlink.bin -- gene crosslinks {genome_release}/ -- one per genome release features/ -- features important for annotation masked_repeat.bin -- masked repeats masked_seqdup.bin -- masked segmental duplications strucvars/ -- structural variant specific bgdbs/ -- background databases dbvar.bin -- dbVar dgv.bin -- DGV dgv_gs.bin -- DGV gold standard g1k.bin -- 1000 genomes CNVs gnomad_exomes.bin -- gnomAD-exomes/ExAC SVs gnomad_genomes.bin -- gnomAD-genomes SVs clinvar.bin -- ClinVar SVs inhouse.bin -- inhouse SV database patho_mms.bed -- well-known pathogenic DELs/DUPs tads/ hesc.bed -- hESC TAD definitions ``` # Developer Information This section is only relevant for developers of `varfish-server-worker`. ## Development Setup You will also need to have git LFS installed to get the test databases. You will need a recent version of protocolbuffers, e.g.: ``` # bash utils/install-protoc.sh # export PATH=$PATH:$HOME/.local/share/protoc/bin ``` For running protolint, install it as python package `protolint-bin`: ``` # virtualenv /tmp/varfish-server-worker # source /tmp/varfish-server-worker/bin/activate # pip install protolint-bin ``` ### Building from scratch To reduce compile times, we recommend using a pre-built version of `rocksdb`, either from the system package manager or e.g. via `conda`: ```bash # Ubuntu sudo apt-get install librocksdb-dev # Conda conda install -c conda-forge rocksdb ``` In either case, either add ```toml [env] ROCKSDB_LIB_DIR = "/usr/lib/" # in case of the system package manager, adjust the path accordingly for conda SNAPPY_LIB_DIR = "/usr/lib/" # same as above ``` to `.cargo/config.toml` or set the environment variables `ROCKSDB_LIB_DIR` and `SNAPPY_LIB_DIR` to the appropriate paths: ```bash export ROCKSDB_LIB_DIR=/usr/lib/ export SNAPPY_LIB_DIR=/usr/lib/ ``` By default, the environment variables are defined in the `.cargo/config.toml` as described above, i.e. may need adjustments if not using the system package manager. To build the project, run: ```bash cargo build --release ``` To install the project locally, run: ```bash cargo install --path . ```