# The partial sieve function 

The partial sieve function phi(x, a) counts the numbers ≤ x that are not divisible by any of the
first a primes. This function is sometimes also named "Legendre's sum" after the French mathematician
[Adrien-Marie Legendre](https://en.wikipedia.org/wiki/Adrien-Marie_Legendre) who first studied it in
the 19th century [[1]](#References). The partial sieve function is at the heart of all combinatorial
prime counting algorithms. In fact Legendre's prime counting function
pi(x) = pi(√x) + phi(x, pi(√x)) - 1 can be computed using solely the partial sieve function, whereas
the other more advanced combinatorial prime counting algorithms require evaluating the partial sieve
function as well as other functions.

In the advanced combinatorial prime counting algorithms such as the Lagarias-Miller-Odlyzko algorithm
[[3]](#References) and the Deléglise-Rivat algorithm [[4]](#References) the partial sieve function is
mainly used as an auxiliary function during initialization. Therefore the overall performance of these
algorithms does not meaningfully depend upon the execution speed of the partial sieve function.

However, with the advent of multi-core CPUs, the partial sieve function has found another important
use case within the combinatorial prime counting algorithms. For software programs to scale well on
multi-core CPUs it is crucial that the individual worker threads are independent from each other, so
that the threads have exclusive access to the CPU's resources and in order to prevent that a thread
has to wait idle for data from another thread. In 2002 Xavier Gourdon [[5]](#References)
devised a modification to the hard special leaves algorithm so that the computation can be slit up into
independent chunks. This modification relies on the partial sieve function for generating a lookup
table of phi(x, i) results for i ∈ [0, a]. The
[Generate phi(x, i) lookup table](#generate-phix-i-lookup-table) paragraph contains more information.

Hence now the partial sieve function's performance has become critical for parallel implementations
of the combinatorial prime counting algorithms. This document describes the many
[known optimizations](#optimizations) that can be used to speed up the phi(x, a) computation and
it describes a [new optimization](#new-optimization) that I have devised and that has first been
implemented in primecount. It's a compressed cache of phi(x, a) results that speeds up most computations
by more than 10x.

# phi(x, a) in primecount

In primecount the partial sieve function is implemented in the file
[phi.cpp](https://github.com/kimwalisch/primecount/blob/master/src/phi.cpp) (and in
[PhiTiny.hpp](https://github.com/kimwalisch/primecount/blob/master/include/PhiTiny.hpp) &
[PhiTiny.cpp](https://github.com/kimwalisch/primecount/blob/master/src/PhiTiny.cpp)).
The partial sieve function phi(x, a) is also part of
[primecount's C/C++ API](https://github.com/kimwalisch/primecount/blob/master/doc/libprimecount.md#c-api-reference)
and it is available in the [primecount command-line application](https://github.com/kimwalisch/primecount#installation)
via the ```--phi``` option, e.g. phi(1000, 10) can be computed using: ```primecount 1000 10 --phi```

# Recursive formula

### phi(x, a) = phi(x, a - 1) - phi(x / prime[a], a - 1)

This is the main formula for the computation of the partial sieve function. As mentioned in the
introduction this formula was first described by Legendre in his book "Théorie des nombres"
[[1]](#References). When implemented in a computer program the above recursive
phi(x, a) formula with a = pi(√x) allows computing Legendre's prime counting function
pi(x) = pi(√x) + phi(x, pi(√x)) - 1 in O(x / log(x)) operations and using O(√x / log(x)) space.
Tomás Oliveira e Silva's paper [[6]](#References) contains a simple C implementation of this formula:

```C
static void phi(int x, int a, int sign)
{
loop:
    if(a == 0)
        sum += sign * x;
    else if(x < p[a])
        sum += sign;
    else
    {
        --a;
        phi(x / p[a], a, -sign);
        goto loop;
    }
}
```

# Optimizations

There are a large number of known optimizations, that can be used in conjunction with the recursive
phi(x, a) formula and which, when combined, can speed up the computation by many orders for magnitude.
I will now briefly describe all known optimizations, and then further down I will present a
[new optimization](#new-optimization) that I have devised and that has first been implemented in
primecount.

### phi(x, a) = (x / pp) * φ(pp) + phi(x % pp, a)

This formula allows computing phi(x, a) in O(1) for small values of a e.g. for a ≤ 7.
[φ(n)](https://en.wikipedia.org/wiki/Euler%27s_totient_function) is Euler's totient function and pp
denotes the product of the first a primes: pp = 2 * 3 * ... * prime[a]. The use of this formula
requires initializing a lookup table of phi(i, a) results for i ∈ [0, pp[, hence the lookup table has
a size of pp. The German astronomer [Ernst Meissel](https://de.wikipedia.org/wiki/Ernst_Meissel) was
the first who used this formula for the computation of the number of primes below 1 billion at the end
of the 19th century. This formula is also present in Lehmer's paper from 1959 [[2]](#References)
and is described in more detail in most of the other combinatorial prime counting papers. In
primecount this formula is implemented in
[PhiTiny.hpp](https://github.com/kimwalisch/primecount/blob/master/include/PhiTiny.hpp) and
the initialization of the lookup table is implemented in
[PhiTiny.cpp](https://github.com/kimwalisch/primecount/blob/master/src/PhiTiny.cpp).

### if (x%pp ≤ pp/2) phi(x, a) = x/pp * φ(pp) + phi(x%pp, a)<br/>if (x%pp > pp/2) phi(x, a) = x/pp * φ(pp) + φ(pp) - phi(pp - 1 - x%pp, a)

In the formulas above pp corresponds to the product of the first a primes: pp = 2 * 3 * ... * prime[a]
and [φ(n)](https://en.wikipedia.org/wiki/Euler%27s_totient_function) is Euler's totient function.
When it is not possible to compute phi(x, a) in O(1) using the formula from the first paragraph, these
formulas can be used to avoid computing phi(x, a) where x may be large, and instead compute
phi(x%pp, a) or phi(pp - 1 - x%pp, a) where x%pp and pp - 1 - x%pp may be orders of magnitude smaller
than x. I have tested these formulas in primecount, however they did not provide a general speedup.
The main issue with these formulas is that they are only useful for relatively large values of x and
they are limited to small values of a because they involve the product of the first a primes which
grows rather quickly. In computer programs that use 64-bit integers these formulas can be used for
a ≤ 16. These formulas are partially described in R.P. Leopold's paper [[7]](#References).

### Stop recursion at c instead of 1

Using the formula phi(x, a) = (x / pp) * φ(pp) + phi(x % pp, a) it is possible to compute phi(x, c)
in O(1) for small values of c e.g. c ≤ 7. Using this formula we can stop recursion at c instead of 1 in
the main [recursive formula](#phix-a--phix-a---1---phix--primea-a---1) and simply increase the sum
by phi(x, c).

### Calculate all phi(x / prime[i], i - 1) = 1 upfront in O(1)

Once phi(x / prime[i], i - 1) = 1 occurs in the main
[recursive formula](#phix-a--phix-a---1---phix--primea-a---1) all subsequent phi(x / prime[j], j - 1)
computations with j ∈ ]i, a] will also be 1. Generally phi(x / prime[i], i - 1) = 1 if
(x / prime[i] ≤ prime[i-1]). Hence instead of computing phi(x / prime[j], j - 1) individually for all
j ∈ ]i, a] we can simply increase the sum by a - i.

### if (a ≥ pi(√x)) phi(x, a) = pi(x) - a + 1

This formula also allows computing phi(x, a) in O(1) provided that a is relatively large and x is
relatively small. If (a ≥ pi(√x)) then phi(x, a) counts the number of primes ≤ x, minus the first
a primes, plus the number 1. The use of this formula requires using a pi(x) lookup table of size x.
In order to reduce the memory usage it is best to use a compressed pi(x) lookup table such as
primecount's [PiTable.hpp](https://github.com/kimwalisch/primecount/blob/master/include/PiTable.hpp).
The use of a lookup table makes this formula unsuitable for computing phi(x, a) for large values
of x due to its excessive memory requirement. However, for large values of x we can compute the
pi(x) part of this formula using a prime counting function implementation in O(x^(2/3)) or less
instead of a lookup table which uses much less memory.

### if (a ≥ pi(3√x) && a < pi(√x)) phi(x, a) = pi(x) + P2(x, a) - a + 1

In the formula above 3√x denotes the 3rd root of x and P2(x, a) is the 2nd partial sieve function
which counts the numbers ≤ x that have exactly 2 prime factors each exceeding the a-th prime.
If (a ≥ pi(4√x) && a < pi(3√x)) then one needs to add the P3(x, a) term i.e.
phi(x, a) = pi(x) + P2(x, a) + P3(x, a) - a + 1. The formulas from this paragraph are not yet
being used in primecount, even though I expect that their use could significantly speed up some
phi(x, a) computations.

# New optimization

Due to the recursive nature of the [main phi(x, a) formula](#phix-a--phix-a---1---phix--primea-a---1)
the same values of phi(i, j) are calculated over and over again, this is especially true for small to
medium values of i and j. The formula phi(x, a) = (x / pp) * φ(pp) + phi(x % pp, a) can be used to
avoid recursion, however it is limited to small values of a ≤ c with c being a small constant e.g.
c = 7. The formula phi(x, a) = pi(x) - a + 1 can also be used to compute phi(x, a) in O(1), however
it is limited to large values of a ≥ pi(√x). Hence there is currently no known optimization for
computing phi(x, a) for medium values of a ∈ ]c, pi(√x)[.

The new optimization that I have devised is a **phi(i, j) cache** for small to medium
values of i and j e.g. i ≤ √x and j ≤ 100. The more phi(i, j) results are cached, the fewer recursive
calls occur in the  [main phi(x, a) formula](#phix-a--phix-a---1---phix--primea-a---1) and the faster
it runs. However, on the other hand we are memory constrained, we cannot cache everything and
ideally our phi(i, j) cache should fit into the CPU's fast cache memory. Hence the main goal for our
cache is to store as many phi(i, j) results as possible using as little memory as possible.

The densest data structure for storing the count of numbers ≤ n that are not divisible by any of the
first a primes that I am aware of is a bit array. If a bit is set, this means the corresponding
number is not divisible by any of the first a primes. The 8 bits of each byte correspond to the offsets
[ 1, 7, 11, 13, 17, 19, 23, 29 ], hence each byte represents an interval of size 30. However to determine
the count of numbers ≤ n that are not divisible by any of the first a primes, we need to iterate over
the bit array and count all set bits that correspond to numbers ≤ n. This means the access time of
our cache would be O(n) which is not great. Therefore we introduce a second array which contains the
count of set bits in the first array below the current index. Using these two arrays we can now count
the numbers ≤ n that are not divisible by any of the first a primes in O(1) operations. Note that the
two arrays may be interleaved which makes our cache slightly more memory efficient. Below is the
corresponding code from primecount:

```C++
int64_t phi_cache(uint64_t x, uint64_t a)
{
  // We devide by 240 instead of 30 here because our arrays
  // use the uint64_t type instead of std::byte.
  uint64_t count = sieve_[a][x / 240].count;
  uint64_t bits = sieve_[a][x / 240].bits;
  uint64_t bitmask = unset_larger_[x % 240];
  return count + popcnt64(bits & bitmask);
}
```

Before being able to use the phi(x, a) cache it needs to be initialized. The cache can be initialized
using a modified version of the [sieve of Eratosthenes](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes).
In primecount the cache is lazily initialized during the execution of the
[main phi(x, a) formula](#phix-a--phix-a---1---phix--primea-a---1). Whenever a new phi(x, i) computation
is started we first check whether that result is not yet present in the cache and if x & i meet the caching
criteria. If these conditions apply the cache will be filled up to x & i. In the first part of this algorithm
we unset the bits that correspond to numbers that are divisible by the i-th prime. When sieving has
finished, we proceed to the second part of the algorithm where we count all set bits (below each index)
in the first array and store that count in the second array. Below is the corresponding code from primecount:

```C++
/// Cache phi(x, i) results with: x <= max_x && i <= min(a, max_a).
/// Eratosthenes-like sieving algorithm that removes the first a primes
/// and their multiples from the sieve array. Additionally this
/// algorithm counts the numbers that are not divisible by any of the
/// first a primes after sieving has completed. After sieving and
/// counting has finished phi(x, a) results can be retrieved from the
/// cache in O(1) using the phi_cache(x, a) method.
///
void init_cache(uint64_t x, uint64_t a)
{
  // Each bit in the sieve array corresponds to an integer that
  // is not divisible by 2, 3 and 5. The 8 bits of each byte
  // correspond to the offsets { 1, 7, 11, 13, 17, 19, 23, 29 }.
  sieve_[3].resize(max_x_size_);

  for (uint64_t i = 4; i <= a; i++)
  {
    // Initalize phi(x, i) with phi(x, i - 1)
    sieve_[i] = sieve_[i - 1];

    // Remove prime[i] and its multiples
    for (uint64_t n = primes_[i]; n <= max_x_; n += primes_[i] * 2)
      sieve_[i][n / 240].bits &= unset_bit_[n % 240];

    // Fill an array with the cumulative 1 bit counts.
    // sieve[i][j] contains the count of numbers < j * 240 that
    // are not divisible by any of the first i primes.
    uint64_t count = 0;
    for (auto& sieve : sieve_[i])
    {
      sieve.count = count;
      count += popcnt64(sieve.bits);
    }
  }
}
```

According to my own benchmarks the cache, as implemented above, speeds up primecount's phi(x, a)
implementation by more than 10x. Based on my empirical tests caching phi(x, a) results for
a ≤ 100 provides the best performance, as mentioned earlier smaller values of x & a are
accessed much more frequently than larger values. I also limit the size of the cache to about 16
megabytes in primecount which is slightly larger than my CPU's L3 cache size. Using an even
larger cache size deteriorates performance especially when using multi-threading.

# Generate phi(x, i) lookup table

In 2002 Xavier Gourdon [[5]](#References) devised a modification to the hard special leaves
algorithm so that the computation can be slit up into independent chunks. This modification is
particularly useful for parallelizing the hard special leaves algorithm, since it avoids the
need for frequent thread synchronization. This modification relies on the partial sieve
function for generating a lookup table of phi(x, i) results for i ∈ [0, a]. The idea of the
algorithm is described very shortly in Gourdon's paper [[5]](#References) and it is also
described in some more detail in Douglas Staple's paper [[8]](#References), however no pseudocode
is provided in both papers.

Computing phi(x, i) individually for all i ∈ [0, a] would be far too slow. However, by taking
advantage of the recursive nature of the main formula phi(x, a) = phi(x, a - 1) - phi(x / prime[a], a - 1),
we can actually generate a lookup table of phi(x, i) results for i ∈ [0, a] in the same
amount of time it takes to compute phi(x, a)! We first compute phi(x, 0), next we compute
phi(x, 1) and reuse the phi(x, 0) result we have computed previously. Then we compute
phi(x, 2) and reuse our previous phi(x, 1) result and so forth. The code below shows how
this algorithm can be implemented using a simple for loop. The ```phi_recursive(x / primes[i], i - 1)```
part needs to be computed using the recursive phi(x, a) formula in conjunction with the
optimizations described in this document.

```C++
int* phi = new int[a + 1];
phi[0] = x;

// Fill an array with phi(x, i) results
for (int i = 1; i <= a; i++)
  phi[i] = phi[i-1] - phi_recursive(x / primes[i], i - 1);
```

# References

1. A. M. Legendre, Théorie des nombres, Third edition, Paris, 1830. Vol. 2, p. 65.
2. D. H. Lehmer, On the exact number of primes less than a given limit, Illinois J. Math. 3 (1959), pp. 381–388.
3. J. C. Lagarias, V. S. Miller, and A. M. Odlyzko, Computing pi(x): The Meissel-Lehmer method, Mathematics of Computation, 44 (1985), pp. 537–560.
4. M. Deleglise and J. Rivat, "Computing pi(x): The Meissel, Lehmer, Lagarias, Miller, Odlyzko Method", Mathematics of Computation, Volume 65, Number 213, 1996, pp 235–245.
5. Xavier Gourdon, Computation of pi(x) : improvements to the Meissel, Lehmer, Lagarias, Miller, Odllyzko, Deléglise and Rivat method, February 15, 2001.
6. Tomás Oliveira e Silva, Computing pi(x): the combinatorial method, Revista do DETUA, vol. 4, no. 6, March 2006, pp. 759-768.
7. R.P. Leopold, On Methods for Computing pi(x), 2007.
8. Douglas B. Staple, The combinatorial algorithm for computing pi(x), Master of Science Thesis, Dalhousie University Halifax, Nova Scotia, August 2015.