# Envoy Stats System Envoy statistics track numeric metrics on an Envoy instance, optionally spanning binary program restarts. The metrics are tracked as: * Counters: strictly increasing 64-bit integers. * Gauges: 64-bit integers that can rise and fall. * Histograms: mapping ranges of values to frequency. The ranges are auto-adjusted as data accumulates. Unlike counters and gauges, histogram data is not retained across binary program restarts. * TextReadouts: Unicode strings. Unlike counters and gauges, text readout data is not retained across binary program restarts. In order to support restarting the Envoy binary program without losing counter and gauge values, they are passed from parent to child in an RPC protocol. They were previously held in shared memory, which imposed various restrictions. Unlike the shared memory implementation, the RPC passing *requires a mode-bit specified when constructing gauges indicating whether it should be accumulated across hot-restarts*. ## Performance and Thread Local Storage A key tenant of the Envoy architecture is high performance on machines with large numbers of cores. See https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310 for details. This requires lock-free access to stats on the fast path -- when proxying requests. For stats, this is implemented in [ThreadLocalStore](https://github.com/envoyproxy/envoy/blob/master/source/common/stats/thread_local_store.h), supporting the following features: * Thread local per scope stat caching. * Overlapping scopes with proper reference counting (2 scopes with the same name will point to the same backing stats). * Scope deletion. * Lockless in the fast path. This implementation is complicated so here is a rough overview of the threading model. * The store can be used before threading is initialized. This is needed during server init. * Scopes can be created from any thread, though in practice they are only created from the main thread. * Scopes can be deleted from any thread, and they are in practice as scopes are likely to be shared across all worker threads. * Per thread caches are checked, and if empty, they are populated from the central cache. * Scopes are entirely owned by the caller. The store only keeps weak pointers. * When a scope is destroyed, a cache flush operation is posted on all threads to flush any cached data owned by the destroyed scope. * Scopes use a unique incrementing ID for the cache key. This ensures that if a new scope is created at the same address as a recently deleted scope, cache references will not accidentally reference the old scope which may be about to be cache flushed. * Since it's possible to have overlapping scopes, we de-dup stats when counters() or gauges() is called since these are very uncommon operations. * Overlapping scopes will not share the same backing store. This is to keep things simple, it could be done in the future if needed. ### Histogram threading model Each Histogram implementation will have 2 parts. * *main* thread parent which is called `ParentHistogram`. * *per-thread* collector which is called `ThreadLocalHistogram`. Worker threads will write to ParentHistogram which checks whether a TLS histogram is available. If there is one it will write to it, otherwise creates new one and writes to it. During the flush process the following sequence is followed. * The main thread starts the flush process by posting a message to every worker which tells the worker to swap its *active* histogram with its *backup* histogram. This is achieved via a call to the `beginMerge` method. * Each TLS histogram has 2 histograms it makes use of, swapping back and forth. It manages a current_active index via which it writes to the correct histogram. * When all workers have done, the main thread continues with the flush process where the *actual* merging happens. * As the active histograms are swapped in TLS histograms, on the main thread, we can be sure that no worker is writing into the *backup* histogram. * The main thread now goes through all histograms, collect them across each worker and accumulates in to *interval* histograms. * Finally the main *interval* histogram is merged to *cumulative* histogram. `ParentHistogram`s are held weakly a set in ThreadLocalStore. Like other stats, they keep an embedded reference count and are removed from the set and destroyed when the last strong reference disappears. Consequently, we must hold a lock for the set when decrementing histogram reference counts. A similar process occurs for other types of stats, but in those cases it is taken care of in `AllocatorImpl`. There are strong references to `ParentHistograms` in TlsCacheEntry::parent_histograms_. Thread-local `TlsHistogram`s are created on behalf of a `ParentHistogram` whenever accessed from a worker thread. They are strongly referenced in the `ParentHistogram` as well as in a cache in the `ThreadLocalStore`, to help maintain data continuity as scopes are re-created during operation. ## Stat naming infrastructure and memory consumption Stat names are replicated in several places in various forms. * Held with the stat values, in `CounterImpl`, `GaugeImpl` and `TextReadoutImpl`, which are defined in [allocator_impl.cc](https://github.com/envoyproxy/envoy/blob/master/source/common/stats/allocator_impl.cc) * In [MetricImpl](https://github.com/envoyproxy/envoy/blob/master/source/common/stats/metric_impl.h) in a transformed state, with tags extracted into vectors of name/value strings. * In static strings across the codebase where stats are referenced * In a [set of regexes](https://github.com/envoyproxy/envoy/blob/master/source/common/config/well_known_names.cc) used to perform tag extraction. There are stat maps in `ThreadLocalStore` for capturing all stats in a scope, and each per-thread caches. However, they don't duplicate the stat names. Instead, they reference the `StatName` held in the `CounterImpl` or `GaugeImpl`, and thus are relatively cheap; effectively those maps are all pointer-to-pointer. For this to be safe, cache lookups from locally scoped strings must use `.find` rather than `operator[]`, as the latter would insert a pointer to a temporary as the key. If the `.find` fails, the actual stat must be constructed first, and then inserted into the map using its key storage. This strategy saves duplication of the keys, but costs an extra map lookup on each miss. ### Naming Representation When stored as flat strings, stat names can dominate Envoy memory usage when there are a large number of clusters. Stat names typically combine a small number of keywords, cluster names, host names, and response codes, separated by `.`. For example `CLUSTER.upstream_cx_connect_attempts_exceeded`. There may be thousands of clusters, and roughly 100 stats per cluster. Thus, the number of combinations can be large. It is significantly more efficient to symbolize each `.`-delimited token and represent stats as arrays of symbols. The transformation between flattened string and symbolized form is CPU-intensive at scale. It requires parsing, encoding, and lookups in a shared map, which must be mutex-protected. To avoid adding latency and CPU overhead while serving requests, the tokens can be symbolized and saved in context classes, such as [Http::CodeStatsImpl](https://github.com/envoyproxy/envoy/blob/master/source/common/http/codes.h). Symbolization can occur on startup or when new hosts or clusters are configured dynamically. Users of stats that are allocated dynamically per cluster, host, etc, must explicitly store partial stat-names their class instances, which later can be composed dynamically at runtime in order to fully elaborate counters, gauges, etc, without taking symbol-table locks, via `SymbolTable::join()`. ### `StatNamePool` and `StatNameSet` These two helper classes evolved to make it easy to deploy the symbol table API across the codebase. `StatNamePool` provides pooled allocation for any number of `StatName` objects, and is intended to be held in a data structure alongside the `const StatName` member variables. Most names should be established during process initializion or in response to xDS updates. `StatNameSet` provides some associative lookups at runtime. The associations should be created before the set is used for requests, via `StatNameSet::rememberBuiltin`. This is useful in scenarios where stat-names are derived from data in a request, but there are limited set of known tokens, such as SSL ciphers or Redis commands. ### Dynamic stat tokens While stats are usually composed of tokens that are known at compile-time, there are scenarios where the names are newly discovered from data in requests. To avoid taking locks in this case, tokens can be formed dynamically using `StatNameDynamicStorage` or `StatNameDynamicPool`. In this case we lose substring sharing but we avoid taking locks. Dynamically generated tokens can be combined with symbolized tokens from `StatNameSet` or `StatNamePool` using `SymbolTable::join()`. Relative to using symbolized tokens, The cost of using dynamic tokens is: * the StatName must be allocated and populated from the string data every time `StatNameDynamicPool::add()` is called or `StatNameDynamicStorage` is constructed. * the resulting `StatName`s are as long as the string, rather than benefiting from a symbolized representation, which is typically 4 bytes or less per token. However, the cost of using dynamic tokens is on par with the cost of not using a StatName system at all, only adding one re-encoding. And it is hard to quantify the benefit of avoiding mutex contention when there are large numbers of threads. ### Symbol Table Memory Layout Below is a diagram [(source)](https://docs.google.com/drawings/d/1eG6CHSUFQ5zkk-j-kcFCUay2-D_ktF39Tbzql5ypUDc/edit) showing the memory layout for a few scenarios of constructing and joining symbolized `StatName` and dynamic `StatName`. ![Symbol Table Memory Diagram](symtab.png) ### Symbol Contention Risk There are several ways to create hot-path contention looking up stats by name, and there is no bulletproof way to prevent it from occurring. * The [stats macros](https://github.com/envoyproxy/envoy/blob/master/include/envoy/stats/stats_macros.h) may be used in a data structure which is constructed in response to requests. * An explicit symbol-table lookup, via `StatNamePool` or `StatNameSet` can be made in the hot path. It is difficult to search for those scenarios in the source code or prevent them with a format-check, but we can determine whether symbol-table lookups are occurring during via an admin endpoint that shows 20 recent lookups by name, at `ENVOY_HOST:ADMIN_PORT/stats?recentlookups`. ### Symbol Table Class Overview Class | Superclass | Description -----| ---------- | --------- SymbolTable | | Abstract class providing an interface for symbol tables SymbolTableImpl | SymbolTable | Implementation of SymbolTable API where StatName share symbols held in a table SymbolTableImpl::Encoding | | Helper class for incrementally encoding strings into symbols StatName | | Provides an API and a view into a StatName (dynamic or symbolized). Like absl::string_view, the backing store must be separately maintained. StatNameStorageBase | | Holds storage (an array of bytes) for a dynamic or symbolized StatName StatNameStorage | StatNameStorageBase | Holds storage for a symbolized StatName. Must be explicitly freed (not just destructed). StatNameManagedStorage | StatNameStorage | Like StatNameStorage, but is 8 bytes larger, and can be destructed without free(). StatNameDynamicStorage | StatNameStorageBase | Holds StatName storage for a dynamic (not symbolized) StatName. StatNamePool | | Holds backing store for any number of symbolized StatNames. StatNameDynamicPool | | Holds backing store for any number of dynamic StatNames. StatNameList | | Provides packed backing store for an ordered collection of StatNames, that are only accessed sequentially. Used for MetricImpl. StatNameStorageSet | | Implements a set of StatName with lookup via StatName. Used for rejected stats. StatNameSet | | Implements a set of StatName with lookup via string_view. Used to remember well-known names during startup, e.g. Redis commands. ### Hot Restart Continuity of stat counters and gauges over hot-restart is supported. This occurs via a sequence of RPCs from parent to child, issued while child is in lame-duck. These RPCs contain a map of stat-name strings to values. One implementation complexity is that when decoding these names in the child, we must know which segments of the stat names were encoded dynamically. This is implemented by sending an auxiliary map of stat-name strings to lists of spans, where the spans identify dynamic segments. Dynamic segments are rare, used only by Dynamo, Mongo, IP Tagging Filter, Fault Filter, and `x-envoy-upstream-alt-stat-name` as of this writing. So in most cases this dynamic-segment map is empty. ## Tags and Tag Extraction TBD ## Disabling statistics by substring or regex TBD ## Stats Memory Tests Regardless of the underlying data structures used to implement statistics, memory usage will grow with the number of hosts and clusters. When a PR is issued that adds new per-host or per-cluster stats, this will have a multiplicative effect on consumed memory. This can become significant for deployments with O(10k) clusters or hosts. To improve visibility for this memory growth, there are [memory-usage integration tests](https://github.com/envoyproxy/envoy/blob/master/test/integration/stats_integration_test.cc). If a PR fails the tests in that file due to unexpected memory consumption, it gives the author and reviewer an opportunity to consider the cost/value of the new stats. If the test fails because the new byte-count is lower, then all that's needed is to lock in the improvement by updating the expected values. If the new per-cluster or per-host memory consumption is higher, then we must decide whether the value from the added stats justify the overhead for all Envoy deployments. In either case, we must update the golden values and add a comment to the table in the test indicating the memory impact of each PR. Developers trying to can iterate through changes in these tests locally with: ```bash bazel test -c opt --test_env=ENVOY_MEMORY_TEST_EXACT=true \ test/integration:stats_integration_test ```