# Gamma distributions

|                      |            |                           |
|----------------------|------------|---------------------------|
| Variate              |  $x$       |  $\mu < x < \infty$       |
| Lower bound          |  $\mu$     |  $-\infty < \mu < \infty$ |
| Rate (Inverse scale) |  $\lambda$ |  $\lambda > 0$            |
| Shape                |  $\tau$    |  $\tau > 0$               |


|             |                               |
|-------------|-------------------------------|
| Mean        |  $\frac{\tau}{\lambda} + \mu$ |
| Variance    |  $\frac{\tau}{\lambda^2}$     |

The distribution starts at $\mu$ (support is $x > \mu$).  We typically
know $\mu$. It is often 0; or in the case of fitting a gamma density
to a right tail, we know the threshold $\mu$ at which we truncated the
tail.

Probability density function (PDF):

$$
   P(x) =  \frac{\lambda^{\tau}}{\Gamma(\tau)}  (x-\mu)^{\tau-1}  e^{-\lambda (x - \mu)}
$$

The cumulative distribution function (CDF) does not have an analytical
expression. It is calculated numerically using the incomplete Gamma
function (`esl_stats_IncompleteGamma()`).

Some sources will say $x \geq \mu$, but $P(x=\mu) = 0$. Easel code
makes sure $x > \mu$ to avoid $\log P$ terms of $-\infty$.

### Maximum likelihood parameter estimation

#### Complete data; known $\mu$


Given a complete dataset of $n$ observed samples $x_i$ ($i=1..n$) and
known $\mu$, `esl_gam_FitComplete()` estimates maximum likelihood
parameters $\hat{\tau}$ and $\hat{\lambda}$ using a generalized Newton
optimization [Minka00,Minka02].

The optimization only needs two sufficient statistics, the means
$\bar{x} = \frac{1}{n} \sum_i (x_i - \mu)$ and 
$\overline{\log x} = \frac{1}{n} \sum_i \log (x_i - \mu)$,
which we precalculate.

This method chooses initial starting guesses:

$$
   \tau_0  =  \frac{0.5}{\log \bar{x} - \overline{\log x}}
$$
$$
   \lambda_0  =  \frac{\tau_0}{\bar{x}}
$$

and then iterates to convergence:

$$
   \frac{1}{\tau_{t+1}} = \frac{1}{\tau_t} + \frac{\log \tau - \log \bar{x} + \overline{\log x} - \Psi(\tau) } { \tau_t - \tau_t^2 \Psi'(\tau_t)}
$$
$$
   \lambda_{t+1}  =  \frac{\tau_{t+1}}{\bar{x}}
$$

The first and second derivatives of $\log \Gamma(x)$ are called
$\Psi(x)$ and $\Psi'(x)$, the "digamma" and "trigamma" functions.
These are obtained numerically by `esl_stats_Psi()` and
`esl_stats_Trigamma()`.

See the appendix at the end for the derivation of Minka's method.


## Appendix 1: Minka's generalized Newton method


Minka's generalized Newton method works as follows. We aim to maximize
the log likelihood of the data:

$$
  \log P(\mathbf{x} \mid \tau, \lambda) =  n \tau \log \lambda - n \log \Gamma(\tau) + (\tau-1) \sum_i \log(x_i - \mu) - \lambda \sum_i (x_i - \mu)
$$

It's equivalent to maximize the average log likelihood, which we can
write in terms of two sufficient statistics, the means $\bar{x} = \frac{1}{n} \sum_i (x_i - \mu)$ and 
$\overline{\log x} = \frac{1}{n} \sum_i \log (x_i - \mu)$:

$$
  \frac{1}{n} \log P(\mathbf{x} \mid \tau, \lambda) = \tau \log \lambda - \log \Gamma(\tau) +  (\tau-1) \overline{\log x} - \lambda \bar{x}
$$

Take derivative w.r.t. $\lambda$, set to zero, solve for $\hat{\lambda}$:

$$
  \frac{\partial}{\partial \lambda}  = \frac{\tau}{\lambda} - \bar{x} = 0 
$$  
$$
  \hat{\lambda} = \frac{\tau}{\bar{x}}
$$

This makes sense because the mean of a Gamma is $\frac{\tau}{\lambda} + \mu$.

Substitute $\hat{\lambda} = \frac{\tau}{\bar{x}}$ back into the
average log likelihood to get an objective function $f(\tau)$ in
terms of a single variable $\tau$:

$$
  f(\tau) = \frac{1}{n} \log P(\mathbf{x} \mid \tau, \hat{\lambda}) = \tau \log \tau - \tau \log \bar{x} - \log \Gamma(\tau) + (\tau-1) \overline{\log x} - \tau
$$  

which is differentiable:

$$
 f'(\tau) = \log \tau - \log \bar{x} - \Psi(\tau) + \overline{\log x} 
$$
$$
 f''(\tau) = \frac{1}{\tau} - \Psi'(\tau)
$$ 

Newton's method for finding the optimum of $f(x)$ works iteratively,
by approximating $f(x)$ locally around a current point $x_t$ by fitting a Gaussian distribution $g(x)$ 
to match $f$ and its first and second derivatives (i.e. $f(x_t) = g(x_t)$, $f'(x_t) = g'(x_t)$, $f''(x_t) = g''(x_t)$),
then analytically locating the optimum of $g(x)$ to propose the next point $x_{t+1}$. 

Minka generalizes Newton's method by observing that we may be able to
find a much better approximation $g(x)$ for our particular
$f(x)$. Here, he proposes the approximation:

$$
   g(\tau) =  c_0 + c_1 \tau + c_2 \log \tau \approx \frac{1}{n} \log P(\mathbf{x} \mid \tau, \hat{\lambda})
$$

Constructing $g(\tau)$ to locally approximate our objective function $f(\tau)$ at a point $\tau_t$ means:

$$ 
  g(\tau_t) = f(\tau_t) = c_0 + c_1 \tau_t + c_2 \log \tau_t
$$
$$
  g'(\tau_t) = f'(\tau_t) = c_1 + \frac{c_2}{\tau_t} 
$$
$$
 g''(\tau_t) = f''(\tau_t) = -\frac{c_2}{\tau_t^2}
$$ 

which we can solve for estimates of the three parameters of the local approximation $g(\tau)$:

$$
  c_2 = - \tau_t^2 f''(\tau_t)
$$
$$
  c_1 = f'(\tau_t) + \tau_t f''(\tau_t)
$$
$$
  c_0 = f(\tau_t) - \tau_t f'(\tau_t) + (\log(\tau_t) + 1) \tau_t^2 f''(\tau_t)
$$  

The optimum $\hat{\tau}$ of $g(\tau)$ is where $g'(\tau) = 0$ so we didn't even
need $c_0$, we only need $c_1$ and $c_2$:

$$
  g'(\tau) = c_1 + \frac{c_2}{\tau} = 0
$$

Solving for $\hat{\tau}$ gives $-\frac{c_2}{c_1}$, which is:

$$
  \hat{\tau} = \frac{\tau_t^2 f''(\tau_t)}{f'(\tau_t) + \tau_t f''(\tau_t)}
$$  

which is more compactly written as:

$$
  \frac{1}{\hat{\tau}} = \frac{1}{\tau_t} + \frac{f'(\tau_t)}{\tau_t^2 f''(\tau_t)}
$$  

Finally, substituting $f'$ and $f''$ gives us our iterative reestimation equation for our
next point $\tau_{t+1}$:

$$
 \boxed{
   \frac{1}{\tau_{t+1}} = \frac{1}{\tau_t} + \frac{\log \tau - \log \bar{x} + \overline{\log x} - \Psi(\tau) } { \tau_t - \tau_t^2 \Psi'(\tau_t)}
 }
$$

That's our iterative reestimation, but we still need to choose an initial starting point $\tau_0$. Minka observes
that $\Psi(\tau) \approx \log(\tau) - \frac{1}{2\tau}$ (from a Stirling approximation
$\log \Gamma(\tau) \approx \tau \log \tau - \tau - \frac{1}{2} \tau + \mathrm{const.}$), which
we can substitute into $f'(\tau)$, set to zero, and solve:

$$
\boxed{   \tau_0 = \frac{0.5}{\log \bar{x} - \overline{\log x}} \approx \hat{\tau} }
$$  


## References

[Minka00] TP Minka, ["Beyond Newton's
Method"](https://tminka.github.io/papers/minka-newton.pdf),
unpublished note (2000).

[Minka02] TP Minka, ["Estimating a Gamma
distribution"](https://tminka.github.io/papers/minka-gamma.pdf),
unpublished note (2002).