Crates.io | diploid-contam-estimator |
lib.rs | diploid-contam-estimator |
version | 0.3.4 |
source | src |
created_at | 2022-08-18 18:23:30.531727 |
updated_at | 2022-10-29 19:29:03.303135 |
description | Estimating contamination level in diploid DNA seuqencing libraries |
homepage | |
repository | https://github.com/wckdouglas/diploid-contam |
max_upload_size | |
id | 648270 |
size | 11,296,782 |
A module to estimate contamination level from diploid variant calls. This is heavily inspired by Dcon.
We are hypothesizing the contamination level would be between $0-0.4$, and at each hypothesized contamination level, we'll calculate the likelihood observing the observed variant-allele-frequency assuming the given contamination level is true for each variant in the VCF file using a binomial model. We then sum the log likelihood for all variants and pick the maximum likelihood contamination level as the final call.
Pseudo code:
n = total_count
x = alt_allele_count
contam = 0
max_log_likelihood = -inf
for contam_level in all_contamination_level:
p = expected_alt_fraction_for_the_given_contamination_level
log_likelihood = sum(binom_loglik(n, x, p) for all_variants)
if log_likelihood > max_log_likelihood:
contam = contam_level
A simulated study at here.
For a homozygous variant, the probability of observing the expected variant-allele-count ($x$) with a read depth $n$ at a given contamination level $c \in [0,0.4]$ is :
$$ P(X=x,c) = \binom{n}{x}p^x(1 - p)^{n-x} $$
where $p = (1-c)$ in all homozygous variants
For a heterozygous variant, the probablity of observing the expected variant-allele-count ($x$) with a read depth $n$ at a given contamination level $c \in [0,0.4]$ will follow the above binomial distribution but $p$ can either be:
After evaluating these cases, we will pick the highest probability event when summing the log likelihoods for the given contamination level.
We also wrote the code in rust.
To run the code:
$ cargo run -- -i data/test.vcf -d debug_json
or:
$ cargo install --path .
$ target/release/diploid-contam-estimator
Douglas Wu <wckdouglas@gmail.com>
Estimating contamination level from a diploid VCF file
The program assume we are dealing with a diploid genome, and using the
deviation of allelic balance from the expected allelic frequence for homozygous
or heterozygous variant calls to compute a contamination value.
For homozygous variants, we deviation from allelic frequency of 1 is all introduced by
contaminaion.
For heterozygous variants, it is a little more complex, because it could be due to:
1. contamination that doesn't look like the HET ALT allele: we expect lower HET alt allele
frequency
2. contamination that doesn't look like the HOM ALT allele: we expect High HET alt allele
frequency
3. contamination that looks like the ALT allele: we expect higher alt allele frequency
4. contamination that looks like the REF allele: we expect lower alt allele frequency
5. contamination being called as ALT
USAGE:
diploid-contam-estimator [OPTIONS] --in-vcf <in_vcf>
OPTIONS:
-d, --debug-json <debug_json>
A json output file for storing all intermediate log prob
-h, --help
Print help information
-i, --in-vcf <in_vcf>
A diploid vcf file for estimating contamination
-m, --min-depth <depth_threshold>
Minimum depth for a variant to be considered (i.e. DP tag) [default: 0]
-o, --out-json <out_json>
A json output file for storing the maximum likelihood contam level for the vcf file
--snv-only
Only use SNV (ignore indel) for contamination estimations
-v, --debug-variant-json <debug_variant_json>
A json output file for storing all input variants used for calculation
-V, --version
Print version information