Crates.io | minmon |
lib.rs | minmon |
version | 0.9.1 |
source | src |
created_at | 2022-12-18 06:37:31.30027 |
updated_at | 2024-08-25 19:41:02.053775 |
description | An opinionated minimal monitoring and alarming tool |
homepage | |
repository | https://github.com/flo-at/minmon |
max_upload_size | |
id | 740263 |
size | 287,612 |
This tool is just a single binary and a config file. No database, no GUI, no graphs. Just monitoring and alarms. I wrote this because the existing alternatives I could find were too heavy, mainly focused on nice GUIs with graphs (not on alarming), too complex to setup, or targeted at cloud/multi-instance setups.
The checks read the measurement values that will be monitored by MinMon.
An action is triggered, when a check's alarm changes its state or a report event is triggered.
The absence of alarms can mean two things: everything is okay or the monitoring/alarming failed altogether. That's why MinMon can trigger regular report events to let you know that it's up and running.
interval
of the check) instead of seconds. It's not very user-friendly but helps to keep the internal processing and the code simple and efficient.The config file uses the TOML format and has the following sections:
graph TD
A(Config file) --> B(Main loop)
B -->|interval| C(Check 1)
B -.-> D(Check 2..n)
C -->|data| E(Alarm 1)
C -.-> F(Alarm 2..m)
E -->|cycles, repeat_cycles| G(Action)
E -->|recover_cycles| H(Recover action)
E -->|error_repeat_cycles| I(Error action)
E --> J(Error recover action)
style C fill:green;
style D fill:green;
style E fill:red;
style F fill:red;
style G fill:blue;
style H fill:blue;
style I fill:blue;
style J fill:blue;
Each alarm has 3 possible states. "Good", "Bad" and "Error".
It takes cycles
consecutive bad data points to trigger the transition from "Good" to "Bad" and recover_cycles
good ones to go back. These transitions trigger the action
and recover_action
actions.
During the "Bad" state, action
will be triggered again every repeat_cycles
cycles (if repeat_cycles
is not 0).
The "Error" state is a bit special as it only "shadows" the other states.
An error means that there is no data available at all, e.g. the filesystem usage for /home
could not be determined.
Since this should rarely ever happen, the transition to the error state always triggers the error_action
on the first cycle. If there is valid data on the next cycle, the state machine continues as if the error state did not exist and the error_recover_action
is triggered.
stateDiagram-v2
direction LR
[*] --> Good
Good --> Good
Good --> Bad: action/cycles
Good --> Error: error_action
Bad --> Good: recover_action/recover_cycles
Bad --> Bad: repeat_action/repeat_cycles
Bad --> Error: error_action
Error --> Good: error_recover_action
Error --> Bad: error_recover_action
Error --> Error: error_repeat_action/error_repeat_cycles
Check the mountpoint at /home
every minute. If the usage level exceeds 70% for 3 consecutive cycles (i.e. 3 minutes), the "Warning" alarm triggers the "Webhook 1" action. The action repeats every 100 cycles until the "Warning" alarm recovers. This happens after 5 consecutive cycles below 70% which also triggers the "Webhook 1" action. If there is an error while checking the filesystem usage, the "Log error" action is triggered. This is repeated every 200 cycles.
[[checks]]
interval = 60
name = "Filesystem usage"
type = "FilesystemUsage"
mountpoints = ["/home"]
[[checks.alarms]]
name = "Warning"
level = 70
cycles = 3
repeat_cycles = 100
action = "Webhook 1"
recover_cycles = 5
recover_action = "Webhook 1"
error_repeat_cycles = 200
error_action = "Log error"
[[actions]]
name = "Webhook 1"
type = "Webhook"
url = "https://example.com/hook1"
body = """{"text": "{{check_name}}: Alarm '{{alarm_name}}' for mountpoint '{{check_id}}' changed state to *{{alarm_state}}* at {{level}}."}"""
headers = {"Content-Type" = "application/json"}
[[actions]]
name = "Log error"
type = "Log"
level = "Error"
template = """{{check_name}} check didn't have valid data for alarm '{{alarm_name}}' and id '{{alarm_id}}': {{check_error}}."""
# This is a block comment. It demonstrates how to add another check and alarm.
# [[checks]]
# name = "System pressure"
# type = "PressureAverage"
# cpu = true
# avg60 = true
#
# [[checks.alarms]]
# name = "Warning"
# level = 80
# action = "Another action"
The webhook text will be rendered into something like "Warning: Filesystem usage on mountpoint '/home' reached 70%."
graph TD
A(example.toml) --> B(Main loop)
B -->|every 60 seconds| C(FilesystemUsage 1: '/srv')
C -->|level '/srv': 60%| D(LevelAlarm 1: 70%)
D -->|cycles: 3, repeat_cycles: 100| E(Action: Webhook 1)
D -->|recover_cycles: 5| F(Recover action: Webhook 1)
D -->|error_repeat_cycles: 200| G(Error action: Log error)
style C fill:green;
style D fill:red;
style E fill:blue;
style F fill:blue;
style G fill:blue;
Just to give some ideas of what's possible:
notify-send
when the filesystem fills up.To improve the reusability of the actions, it's possible to define custom placeholders for the report, events, checks, alarms and actions.
When an action is triggered, the placeholders (generic and custom) are merged into the final placeholder map.
Inside the action (depending on the type of the action) the placeholders can be used in one or more config fields using the {{placeholder_name}}
syntax.
There are also some generic placeholders that are always available.
Placeholders that don't have a value available when the action is triggered will be replaced by an empty string.
Filters can be applied to transform the measurement data. This has different use cases. For example:
They can be configured for checks, in which case they affect all alarms that belong to the check, or alarms individually. Having both options reduces duplication in the config file in some cases. The check is the preferred place for filtering because it's only done once for all alarms which reduces memory and CPU usage.
To pull the docker image use
docker pull ghcr.io/flo-at/minmon:latest
or the example docker-compose.yml file.
In both cases, read-only mount your config file to /etc/minmon.toml
.
Make sure cargo and OpenSSL are correctly installed on your local machine.
You can either install MinMon from crates.io using
cargo install --all-features minmon
Or if you already checked out the repository, you can build and install your local copy like this:
cargo install --all-features --path .
Copy the systemd.minmon.service
file to /etc/systemd/system/minmon.service
and place your config file at path /etc/minmon.toml
.
You can enable and start the service with systemctl daemon-reload && systemctl enable --now minmon.service
.
Use your package manager of choice to install the minmon package from the AUR.
Place your config file at path /etc/minmon.toml
.
You can enable and start the service with systemctl daemon-reload && systemctl enable --now minmon.service
.
Build with --features systemd
to enable support for systemd.
Type=notify
).WatchdogSec=x
).Build with --features sensors
to enable support for lm_sensors.
For the docker image, optionally mount your lm_sensors config file(s) to /etc/sensors.d/
.
Note: libsensors is not cooperative and might theoretically block the event loop.
See CONTRIBUTING.md