Overview, Common Concepts and Options
=====================================

When a system is under resource contention, various operating system
components and applications interact in complex ways. The interactions can't
easily be captured in synthetic per-component benchmarks making it difficult
to evaluate how the hardware and operating system would perform under such
conditions. resctl-bench solves the problem by exercising the whole system
with realistic workloads and analyzing system and workload behaviors.

Many of benchmarks implemented in resctl-bench have detailed explanations in
`resctl-demo`. Give it a try:

  https://github.com/facebookexperimental/resctl-demo


Common Concepts
===============

rd-hashd
--------

`rd-hashd` is a simulated latency-sensitive request-servicing workload with
realistic system resource usage profile and contention responses. Its page
cache and heap access patterns follow normal distributions and the load
level is regulated by both the target RPS and maximum response latency. The
default parameters are tuned so that resource-wise the behavior is rough an
approximation of a popular FB production workload.

While one workload can't possibly capture the many ways that systems are
used, `rd-hashd`'s behaviors and requirements fall where many
human-interactive and machine-saturating workloads' would.

`rd-hashd` has its own sizing benchmark mode where it tries to figure out
the parameters to saturate all of CPU, memory and IO. It finds the maximum
RPS that the CPUs can churn out and then figures out the maximum page cache
and heap footprints that the memory and IO device can service.
`resctl-bench` often uses this benchmark mode, usually with the CPU part
faked, to evaluate IO devices.

For more details: `rd-hashd --help`


Memory Offloading and Profile
-----------------------------

Not all memory areas are equally hot. If the IO device is performant enough,
the tail-end of the access distribution can be offloaded without violating
latency requirements. Modern SSDs, even the mainstream ones, can serve this
role in the [memory
hierarchy](https://en.wikipedia.org/wiki/Memory_hierarchy) quite effectively
by offloading page cache to filesystem and heap to swap.

This memory-offloading usage is critical not only because it makes a much
more efficient use of RAM but also because this is what happens when the
system is contended for memory and IO. If the system can't effectively
handle memory offloading, the application's quality-of-service will be
severely impacted under resource pressure, which lowers service reliability
and forces under-utilization of the systems.

`resctl-bench` uses the amount of `rd-hashd`'s memory footprint that can be
offloadedto the underlying IO device as the primary IO performance metric.
It's usually reported as MOF (Memory Offloading Factor) whose definition is:

```
   SUPPORTABLE_MEMORY_FOOTPRINT / MEMORY_SIZE
```

For example, the MOF of 1.2 means that the IO device can offload 20% of
available memory without violating service requirements. Note that both
bandwidth and latency contribute to MOF - bandwidth is meaningful only when
quick enough to meet the latency requirements.

The IO usage of memory offloading are influenced by the amount of available
memory. To ensure that bench results including MOFs are comparable across
different setups, `resctl-bench` uses memory balloon to constrain the amount
of available memory to a common value. This is called `mem_profile` (memory
profile) which is in gigabytes and always a power-of-two. The default
`mem_profile` is 16 and can be changed with the `--mem-profile` option.

The `mem_profile` of 16 tries to emulate a machine with 16G of memory. As
not all memory would be available for the workload, the net amount available
for workload is called `mem_share`. `rd-hashd` sizing benchmark is run with
a bit less memory to account for the raised memory requirement for longer
non-bench runs. This amount is called `mem_target`.

`resctl-bench` needs to know how much memory is actually available to
implement `mem_profile` and automatically tries to estimate on demand. If
the available memory amount is already known (e.g. from the previous
invocation), `--mem-avail` can be used to skip this step. Some benchmarks
(`storage` and its super benchmarks such as `iocost-qos` and `iocost-tune`)
can detect incorrect `mem_avail` and retry automatically. Those benchmarks
may fail if the amount of available memory keeps fluctuating.


Running Benchmarks
==================

The Result File and Incremental Completion
------------------------------------------

A benchmark run may take a long time and it is often useful to string up a
series of benchmarks - e.g. run `iocost-params` and `hashd-params` to
determine the basic parameters and then `iocost-qos`. While `resctl-bench`
strives for reliability, it is a set of whole system benchmarks which keep
pushing the system to its limits for extended periods of time. Something,
even if not the benchmark itself, can fail once in a while.

`resctl-bench` ensures forward-progress by incrementally updating benchmark
results as they complete. The following command specifies the above three
benchmark sequence. Note that `iocost-qos` will automatically schedule the
two prerequisite benchmarks if the needed parameters are missing. Here,
they're specified explicitly for demonstration purposes.

```
   $ resctl-bench -r result.json run iocost-params hashd-params iocost-qos
```

Let's say the first two benchmarks completed without a hitch but the system
crashed when it was halfway through the `iocost-qos` benchmark. If you
re-run the same command after the system comes back, the following will
happen:

* `resctl-bench` recognizes that `result.json` already contains the results
  from `iocost-params` and `hashd-params`, outputs the summary and applies
  the result parameters without running the benchmarks again.

* Because `iocost-qos` benchmark can easily take multiple hours, it
  implements incremental completion and keeps the result file updated as the
  benchmark progresses. `iocost-qos` will fast-forward to the last
  checkpoint saved in `result.json` and continue from there.

The incremental operation means that the existing result files have
significant effects on how `resctl-bench` behaves. If `resctl-bench` is
behaving in an unexpected way or you want to restart a benchmark sequence
with a clean slate, specify a different result file or delete the existing
one.

The result file is in json. The `summary` and `format` subcommands format
the content into human readable outputs. On the completion of each
benchmark, the result summary is printed out which can be reproduced with
the following:

```
   $ resctl-bench -r result.json summary
```

For more detailed output:

```
   $ resctl-bench -r result.json format
```

By default, all benchmark results in the result file are printed out. You
can select the target benchmarks using the same syntax as the `run`
subcommand. To only view the result of the `iocost-qos` benchmark:

```
   $ resctl-bench -r result.json format iocost-tune
```


The `run`, `study`, `solve` and `format` Stages
-----------------------------------------------

A benchmark is executed in the following four stages, each of which can be
triggered by the matching subcommand. When a stage is triggered, all the
subsequent stages are triggered together.

#### `run`

The actual execution of the benchmark. The system is configured and the
system requirements are verified and recorded. During and after the
benchmark, information is collected and put into the `record` section of the
result file.

A `record` is supposed to contain the minimum amount of information needed
to analyze the benchmark. e.g. it may just contain the relevant time ranges
so that the following `study` stage can analyze the agent report files in
`/var/lib/resctl-demo/report.d`.

#### `study`

This optional stage analyzes what happened during the `run` stage and
produces the `result` from the `record` and the agent report files. It does
not change system configurations or care about system requirements.

The separation of the `run` and `study` stages are useful for debugging and
development as it allows the bulk of data processing to be repeated without
re-running the entire benchmark which may take multiple hours.

`study` is often used with the `pack` subcommand which creates a tarball
containing the result file and the relevant report files:

```
   $ resctl-bench -r result.json pack
```

The resulting tarball can be extracted on any machine and studied:

```
   $ tar xvf output.tar.gz
   $ resctl-bench -r result.json study
```

The above usage is recommended as the report files in their original
location expire after some time. If you want to study the report files in
place:

```
   $ resctl-bench -r result.json study \
     --reports /var/lib/resctl-demo/report.d study
```

#### `solve`

This optional stage post-processes the existing `record` and `result` and
updates the latter. Note that this stage can only access what's inside the
result file and isn't allowed to access the reports or any system
information.

For example, `iocost-tune` uses the `solve` stage to calculate the QoS
solutions from the compiled experiment results so that users can calculate
custom solutions using only the result file.

#### `format / summary`

This stage formats the benchmark result into a human readable form. The
output is usually plain text but some benchmarks support different output
formats (e.g. pdf).

The `summary` subcommand is a flavor of the `format` stage which generates
an abbreviated output. This is what gets printed after each benchmark
completion.


`run` and `format` Subcommand Properties
----------------------------------------

The `run` and `format` subcommands may take zero, one or multiple property
groups. Here's a `run` example:

```
   $ resctl-bench -r result.json run \
     iocost-qos:id=qos-0,storage-base-loops=1:min=100,max=100:min=75,max=75:min=50,max=50
```

We're running an `iocost-qos` benchmark and it has four property groups
delineated with colons. The properties in the first group apply to the whole
run.

#### `id=qos-0`

Specifies the identifier of the run. This is useful when there are multiple
runs of the same benchmark type. Here, we're naming the benchmark `qos-0`.

This is one of several properties which are available for all bench types.

#### `storage-base-loops=1`

This is an `iocost-qos` specific property configuring the repetition count
of the `storage` sub-bench base runs. The default is 3 but we want a quick
run and are setting it to 1. See the doc page of each bench type for details
on the supported properties.

While there is no strict rule on how the extra property groups should be
used, they usually specify a stage in multi-stage benchmarks. Here, we're
telling `iocost-qos` to probe three different QoS settings - vrate at 100,
75 and lastly 50. An empty property group can be specified with two
consecutive colons:

```
   $ resctl-bench -r result.json run iocost-qos:::min=75,max=75:min=50,max=50
```

The triple colons indicate that the first two property groups are empty and
the command will run an `iocost-qos` benchmark with the default parameters
to probe three QoS settings:

1. Default without any overrides

2. vrate at 75%

3. vrate at 50%

Similarly, the `format` subcommand may accept properties:

```
   $ resctl-bench -r result.json format iocost-tune:pdf=output.pdf
```

The above command tells `iocost-tune` to generate an output pdf file instead
of producing text output on stdout.


Common Command Options and Bench Properties
===========================================

Common Command Options
----------------------

Here are explanations on select common command options:

#### `--dir` and `--dev`

By default, `resctl-bench` uses `/var/lib/resctl-demo` for its operation and
expects swaps to be on the same IO device, which it auto-probes. `--dir` can
be used to put the operation directory somewhere else and `--dev` overrides
the underlying IO device detection.

#### `--mem-profile` and `--mem-avail`

For memory-size dependent benchmarks, `--mem-profile` can be used to select
a custom memory profile other than the default of 16. The memory profiles
must be identical for the results to be comparable. You can also turn of
memory profile and run the benchmarks at the machine size.

`resctl-bench` needs to probe how much memory is available when setting up
memory profiles which can be time consuming. If the available memory size is
already known from previous runs, `--mem-avail` can be used to bypass this
step.

#### `--iocost-from-sys` and `--iocost-qos`

Unless overridden, `resctl-bench` uses the `iocost` parameters from
`/var/lib/resctl-demo/bench.json`, which can be updated by `resctl-demo` or
running `iocost-params` benchmark with the `commit` property. If you want to
use the currently configured parameters instead, use `--iocost-from-sys`.
Note that this won't update `/var/lib/resctl-demo/bench.json`.

You can also manually override the iocost QoS parameters with
`--iocost-qos`. For example, `--iocost-qos min=75,max=75` will confine vrate
to 75%.

#### `--swappiness`

`resctl-bench` configures the default swappiness of 60 while running
benchmarks unless overridden by this option.

#### `--force`

When the sytstem can't be configured correctly or some dependencies are
missing, `resctl-bench` prints out error messages and exits. This option
forces `resctl-bench` to continue.


Common Bench Properties
-----------------------

All common properties are for the first property group.

#### `id`

This gives the benchmark an optional identifier which helps with
identification if there are multiple instances of the same bench type in the
series. `resctl-bench` doesn't mind multiple instances of the same bench
type without IDs:

```
   $ resctl-bench -r result.json run \
     iocost-qos \
     iocost-qos::min=50,max=50:min=75,max=75
```

However, if IDs are specified, they must be unique for the bench type.

In addition to helping differntiating bench instances, IDs are used to group
source results when merging with `--by-id` specified.

#### `passive`

`resctl-bench` verifies and changes system configurations so that the
benchmarks can measure the system behavior in a controlled and expected
manner. The configurations that `resctl-bench` controls include but are not
limited to cgroup hierarchy and controllers, IO device elevator and wbt,
sysctl knobs, and btrfs mount options.

While the configuration enforcement helps running benchmarks reliably and
conveniently, it gets in the way when trying to test custom configurations.
The `passive` property can be used to tell `resctl-bench` to accept the
system configurations as-are. The following values are accepted:

* `ALL`: `resctl-bench` won't change any system configurations.

* `all`: Only memory protection for `hostcritical.slice` is enforced.

* `cpu`: Don't touch CPU controller configurations.

* `mem`: Don't touch memory controller and other memory related
  configurations.

* `fs`: Don't touch filesystem related configurations.

* `io`: Don't touch IO controller and other IO related configurations.

* `oomd`: Don't touch existing `oomd` or `earlyoom` instance and don't start
  one either.

* `none`: Clear the passive settings.

Multiple values can be specified by delineating them with `/`:

```
   $ resctl-bench -r result.json run iocost-qos:passive=mem/io
```

#### `apply` and `commit`

These two boolean properties are available in benchmarks that produce either
iocost or hashd parameters. `apply`, when true, makes the benchmark apply
the result parameters to the subsequent benchmarks in the series. `commit`,
when true, makes the benchmark update `/var/lib/resctl-demo/bench.json` with
the result parameters. `commit` implies `apply`.

The parameters can be specified without value to indicate `true`. IOW, the
followings are equivalent:

```
   $ resctl-bench -r result.json run storage:apply
   $ resctl-bench -r result.json run storage:apply=true
```

Note that the properties default to `true` for some benchmarks
(`iocost-params` and `hashd-params`).


Reading Benchmark Results
=========================

Header
------

When formatted, each benchmark result starts with a header which looks like
the following:

```
   [iocost-tune result] 2021-05-08 11:06:38 - 04:16:25

   System info: kernel="5.12.0-work+"
                nr_cpus=16 memory=32.0G swap=16.4G swappiness=60
                mem_profile=16 (avail=30.1G share=12.0G target=11.0G)

   IO info: dev=nvme0n1(259:5) model="Samsung SSD 970 PRO 512GB" firmware="1B2QEXP7" size=477G
            iosched=none wbt=off iocost=on other=off
            iocost model: rbps=2992129542 rseqiops=337745 rrandiops=370705
                          wbps=2232405244 wseqiops=260917 wrandiops=256225
            iocost QoS: rpct=95.00 rlat=11649 wpct=95.00 wlat=12681 min=8.83 max=8.83
```

The first line shows the bench type, ID if available, and time duration of
the run.

The system info block shows the basic system configuration - kernel version,
hardware configuration and the memory profile parameters.

The IO info block shows information on the IO device and IO related kernel
configurations - device model, IO scheduler, wbt status, IO controller
status and iocost parameters. Note that the iocost parameters are captured
at the beginning of the benchmark. For benchmarks which produce their own
parameters, the parameters in the header are not meaningful.

Additionally, if the benchmark was `--force`'d to run, the missed system
requirements will be printed as well.


Nested IO Latency Distribution
------------------------------

For benchmarks which care about IO completion latencies,`resctl-bench`
reports IO them in a table which looks like the following:

```
   READ      min   p25   p50   p75   p90   p95   p99 p99.9   max   cum  mean  stdev
   min      5.0u  5.0u  5.0u 35.0u 45.0u 55.0u 75.0u  155u  165u  5.0u 21.1u  20.4u
   p01      5.0u 45.0u 75.0u 85.0u 95.0u 95.0u  115u  185u  205u  5.0u 66.1u  30.3u
   p05      5.0u 85.0u 85.0u 95.0u  105u  115u  185u  595u  725u  5.0u 90.3u  45.8u
   p10      5.0u 95.0u 95.0u  105u  125u  145u  315u  705u  975u  5.0u  106u  62.9u
   p25      5.0u  115u  125u  145u  205u  245u  795u  955u  985u  105u  146u  95.3u
   p50      5.0u  145u  195u  275u  425u  585u  1.5m  2.5m  2.5m  205u  256u   217u
   p75      5.0u  255u  395u  615u  995u  1.5m  2.5m  5.5m  7.5m  575u  554u   569u
   p90      5.0u  485u  815u  1.5m  2.5m  3.5m  5.5m 12.5m 14.5m  2.5m  1.1m   1.2m
   p95      5.0u  715u  1.5m  2.5m  3.5m  4.5m  8.5m 19.5m 21.5m  4.5m  1.8m   1.8m
   p99      5.0u  1.5m  2.5m  4.5m  7.5m  9.5m 19.5m 56.5m 72.5m 10.5m  3.7m   4.2m
   p99.9   95.0u  2.5m  4.5m  6.5m  9.5m 13.5m 40.5m 71.5m 93.5m 28.5m  5.6m   6.9m
   p99.99   105u  3.5m  5.5m  8.5m 11.5m 15.5m 43.5m 74.5m  250m 59.5m  6.9m   8.9m
   p99.999  115u  4.5m  6.5m  9.5m 12.5m 17.5m 45.5m  150m  350m 84.5m  7.9m  10.5m
   max      125u  5.5m  7.5m 10.5m 14.5m 18.5m 49.5m  250m  450m  250m  9.1m  13.1m
```

The `cum`ulative column shows the usual overall latency percentiles. For
example, in the above table, `p99-cum` (`p99` row, `cum` column) is 10.5m
indicating that the 99th percentile of read completion latencies for the
whole benchmark was 10.5 milliseconds. While this already gives some
insight, it can't distinguish, for example, devices which stall out most
requests in short bursts from the usual spread-out long-tail high latency
events even though the former is a lot more disruptive.

`resctl-bench` calculates the IO completion latency percentiles every second
and then the distribution of them over the whole run. In the above, `p50-99`

- the `p50` row, `p99` column - is 1.5m, indicating that in one out of 100
  1s periods, the median latency is gonna be as high as 1.5 milliseconds.

Similarly, `pNN-mean` and `pNN-stdev` indicate the geometric average and
standard deviation of 1s NN'th percentile completion latencies over the
duration of the benchmark.