Crates.io | ppgg |
lib.rs | ppgg |
version | 0.1.4 |
source | src |
created_at | 2021-08-11 19:48:23.600953 |
updated_at | 2021-08-11 19:48:23.600953 |
description | A library and an associated executable, the library provides tools for building tools that can parse and work for VCF and FASTA files while the associated executable is a command line tool for generating protein sequences from a reference FASTA file and a VCF file |
homepage | https://github.com/ikmb/ppg |
repository | https://github.com/ikmb/ppg |
max_upload_size | |
id | 434923 |
size | 435,747 |
Accelerate the generation of personalized proteomes from a Variant calling format (VCF) file and a reference proteome using graphical processing units (GPUs).
>TRANS_ID
TRANS_SEQ_LINE1
TRANS_SEQ_LINE2
>TRANS_ID
TRANS_SEQ_LINE1
.
.
.
That is, the parser expects every char between > and '\n' to be the transcript name. Also, please make sure that ids used in the file are the same as in the VCF files. Otherwise, the program will not be able to functional properly.
The only exception is this is when the python wrapper is used which work directly with BCF tabix indexed files.
You can decode a BCF file into a VCF using the following command:
bcftools view PATH_TO_BCF -O v -o PATH_TO_VCF
The GPU version of PPGG expects Nvidia-GPU to be accessible on the system during the development we utilized Tesla V100 SXM2 32GB.
Expects a modern multi-core CPU with a big enough RAM to hold the whole file in memory during development a compute node with 512 GB of RAM and a twin intel Xeon CPU were used.
The GPU version of the code can be compiled on a Linux-system with an available NVCC compliler and an Nvidia GPU.
The CPU version of the code can be compiled on a Linux and Mac OS system with Cargo.
PPGG execution logic can be separate into the following main steps:
Reading and parsing the file where the file is read as a UTF-8 encoded string, patient names are extracted and records are filtered where only record with a supported protein coding effect are included into the next step. List of alterations supported by the current mutation is available in the file list_supported_alterations.tsv.
Once the VCF Records have been filtered, bit-masks are decoded and combined with the consequence mutation to generate a hashmap linking each patient to a collection of mutation observed in both of the patients haplotypes.
For each patient, mutations are grouped by the transcript id, i.e. all mutation occurring on a specific transcript are combined together.
For each collection of mutations, mutations are translated into instructions, at that stage mutations are checked for logical errors, e.g. Mutational Engulfment, Where one mutation is a subset of another mutation, or Multiple annotations, where for the same position is annotated with more than one mutation. Also semantic-equivalence where two mutations are different at the genetic level but are equivalent at the protein level is taken place leading to a much smaller and a more consistence definition of alterations at the protein-level. In case any logical error was encountered, a waring message is printed to the standard output descriptor and the transcript is filtered out. Finally, instructions are interpreted and a simple representation for the sequence transcript is generated, internally, this is represented a vector of Tasks.
After encoding each transcript into tasks, all transcripts are concatenated end-to-end to generate a vector of tasks describing the generation of all sequences in the haplotype.
Next, a backend engine is used to execute the tasks and generate the sequences for example, this engine can be a collection of CPU-threads or an execution stream on the GPU.
Finally, the results of the file are written to the Desk using a pool of writer-threads
Two mandatory inputs are needed by the tool, the first is the VCF containing the consequences calling and the second is a FASTA file containing reference sequences.
git clone https://github.com/ikmb/ppg
Please note that git usually comes pre-installed on most Mac OS and Linux systems. If git is not available at your system, you can install it from here
cd ppg
Please notice that after calling git, a directory named ppg in the directory from which git has been called.
To follow along, make sure the executable ppg has been installed in your system and is available on your PATH. Incase it is not installed, check the installation guideline below.
Let's Inspect the GPU arrays, instruction's generation and the Task's arrays
export DEBUG_GPU=TRUE
export INSPECT_TXP=TRUE
export INSPECT_INS_GEN=TRUE
for more details about the meaning of the exported, check the Environment Variables section below
mkdir results
ppg -f examples/example_file.vcf -r examples/References_sequences.fasta -vs -g st -o results
Where o flag determine the path to write the fasta file, the s guide the program to write stats and v for printing log statement.
PPGG also utilize environmental variable heavily to customize its behavior, the list of environmental variable utilized by the PPGG is shown below:
DEBUG_GPU => Inspect the input arrays to the GPU are inspected for indexing error, incase of an indexing error the full input table is printed and idex of the row with the first indexing error is also printed to the standard output descriptor.
DEBUG_CPU_EXEC => Inspect the vector of tasks provided to the input CPU execution engine, incase of an indexing error the full input table is printed and idex of the row with the first indexing error is also printed to the standard output descriptor.
DEBUG_TXP="Transcript_ID" => This flag exports a transcript id that will be used for debugging, while the transcript id for transcript is being create different infos will be logged to the output descriptor.
INSPECT_TXP => If set, after each transcript is translated into instruction an inspection function will be called to check the correctness of translation, if the translation failed then the code will panic and error will be printed to the output descriptor.
INSPECT_INS_GEN => Inspect the translation process from mutations to instructions, as of version 0.1.3 two logical errors are inspected, first, multiple annotations, where more than one mutation are observed at the same position in the protein backbone, or through mutational overlap and engulfment where two mutations overlap in length, for example, insertion at position 60 with 7 amino acids and then a missense mutation at position 64.
PANIC_INSPECT_ERR => If set the code will panic if inspecting the translation from mutation to instruction failed. This is an override of the default behavior were an error message is generated and printed to the output stream.
Compiling the following code will be produce a CPU only version, that means that providing the code with panic if the GPU is specified as an engine, i.e. the parameter -g is set to gpu.
Install Rust from the official website
Clone the current repository
git clone https://github.com/ikmb/ppg
cd ppg
git checkout cpu-only
cargo build --release
cd target/release
./ppg -h # This print the help statement
The following GPU code is only compatible with CUDA and NVIDIA GPUs
Install Rust from the official website
Clone the current repository or Download the source code using the project Github page
git clone https://github.com/ikmb/ppg
cd ppg
Please make sure the following environmental variable are set CUDA_HOME and LD_LIBRARY_PATH, please set the value of these according to your system.
Use any text editor and update the following information in the build script, build.rs which is located the at the root directory, the following the 8th
println!("cargo:rustc-link-search=native=/opt/cuda/11.0/lib64/"); // 8th line in the current version
println!("cargo:rustc-link-search=native=/path two cuda lib64 directory"); // 8th line in the updated version
cargo build --release
cd target/release
./ppg -h # This print the help statement
error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
This problem will be encountered in case any of the two environmental variable, CUDA_HOME and LD_LIBRARY_PATH, are not defined or set. For a permanent solution please update your .bashrc to have these two variables exported.
For further questions, please feel free to open an issue here or send an email to the developers at h.elabd@ikmb.uni-kiel.de or through twitter @HeshamElAbd16
The project was funded by the German Research Foundation (DFG) (Research Training Group 1743, ‘Genes, Environment and Inflammation’)