# `bamcalib` utility `bamcalib` is a small rust program to normalize calibrated chip-seq data using a normalization procedure build upon the original method of [Hu et al.](https://doi.org/10.1093/nar/gkv670). ## Installation You need `cargo` installed - `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh` - `sudo apt install cargo` - `sudo pacman -S rust` - `brew install rust` Then run the following command: ```sh cargo install bamcalib --version 0.1.5 ``` You can also use the `lbmc/bamcalib:0.1.5` Docker image ```sh docker run -it lbmc/bamcalib:0.1.5 bamcalib ``` ## Usage ### Example ```sh bamcalib \ -i IP_11_sorted.bam \ -w T_11_sorted.bam \ -bigwig-ip IP_11.bigwig \ -bigwig-WCE T_11.bigwig \ -t 8 ``` The `bamcalib` expect the following parameters: - `-i`, `--ip-bam ` sorted bam file for the IP data - `-w`, `--wce-bam ` sorted bam file for the WCE data - `-bigwig-ip`, `--bigwig-ip ` normalized bigwig file name for the output - `-bigwig-wce`, `--bigwig-wce ` normalized bigwig file name for the output ### Optional Arguments - `-c`, `--cal-prefix ` calibration prefix added to the chromosome name of the calibration genome (default: `calib_`) - `-h`, `--help` Print help information - `-s`, `--scaling ` multiply the OR by this value (must be the same across sample) [default: 1000] - `-n`, `--no_bits ` remove reads which bit code match this code. You can use [this site](https://broadinstitute.github.io/picard/explain-flags.html) to get an appropriate bit code. The default is `1540` for - segment unmapped - not passing quality controls - PCR or optical duplicate - `-t`, `--thread ` number of threads to parse bam files and write the bigwig file - `--no-fragment-estimation` compute coverage from read information instead of fragment estimation - `--intermediate-bigwig` output normIP and normWCE bigwig files (not just the ratio of the two) - `-V`, `--version` Print version information ## Method ### Normalization [Hu et al.](https://doi.org/10.1093/nar/gkv670) proposed the following normalization procedure to get the normalized coverage in the IP sample ($\text{norm}IP\left(t\right)$) from: - $IP_{x}\left(t\right)$ the coverage at position $t$ in the IP sample on the reference genome - $IP_{c}\left(t\right)$ the coverage at position $t$ in the IP sample on the calibration genome - $WCE_{x}\left(t\right)$ the coverage at position $t$ in the WCE sample on the reference genome - $WCE_{c}\left(t\right)$ the coverage at position $t$ in the WCE sample on the calibration genome In a reference genome of size $`T_x`$ (ignoring the chromosome segmentation), and in a calibration genome of size $`T_c`$, they propose the compute the following $\text{OR}$ factor : ```math \text{OR} = \frac{10^6 \sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)} ``` Instead, we propose the following formula with $\beta$ (default to $10^3$) an arbitrary scaling factor. ```math \text{norm}IP\left(t\right) = IP_{x}\left(t\right) \frac{\beta \frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\frac{1}{T_x}\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)} ``` We want to remove the **technical effect** of the Chip-Seq experiment on the $IP_{x}\left(t\right)$ by scaling by the calibration genome coverage: ```math \frac{1}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)} ``` We want to remove the **differences in cell proportion effect** of the Chip-Seq experiment on the $IP_{x}\left(t\right)$ by scaling by the scaled WCE coverage: ```math \frac{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\frac{1}{T_x}\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)} ``` To be able to better analyze the coverage information at repetitive regions of the genome, we propose to extend the previous normalization to normalize the signal nucleotide by nucleotide and work with the following **OR ratio**: ```math \text{ratio}IP\left(t\right) = \frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)} ``` with: ```math \text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{1}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)} \alpha ``` We then find $`\alpha`$ such that: ```math E\left(\text{norm}IP\left(t\right)\right)= E\left(\frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}\right) ``` which gives ```math \text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{1}{\sum_{t^{'}=1}^{T_x}IP_{x}\left(t^{'}\right)} \sum_{t^{'}=1}^{T_x}\frac{IP_{x}\left(t^{'}\right)}{WCE_{x}\left(t^{'}\right)} ``` Not that all the computation is scaled by the genome size and not at the reads number as in [Hu et al.](https://doi.org/10.1093/nar/gkv670), this is also why we add a scaling factor $\beta$ (default to $10^3$). With this method, we retain the interesting properties of [Hu et al.](https://doi.org/10.1093/nar/gkv670) normalization on the average read density between samples (i.e., we can compare two different samples in a quantitative way) and we account for the local bias of read density observed in the WCE samples (differential chromatin accessibility, repetition, low mappability region, etc.). ### Genome coverage density To compute the coverage density $X_y(t)$ with $X \in \left[IP, WCE\right]$ and $y \in \left[c, x\right]$ we count the number of read $r(t)$ overlapping with position $t$. For properly paired reads (with a mate read on the same chromosome and with a starting position ending after the end of the read) we also count a density of 1 between the end of the first reads and the start of his mate read $g(t)$. $X_y(t) = r(t) + g(t)$. Some fragment can be artificially long, therefore, we compute a robust mean $\mu$ of the gap size, between two reads of a pair, by removing the 0.1 upper and lower value of fragment length. Fragment that has a size higher than $\phi^{-1}(0.95, \mu, 1.0)$ are set to end at the $\phi^{-1}(0.95, \mu, 1.0)$ value, with $\phi()$ the Normal CDF function. Some fragment can be shorter than the read length in this case we don't count the overlapping reads region as a coverage of 2 fragments but as the coverage of 1 fragment.