ogre-watchdog

Crates.ioogre-watchdog
lib.rsogre-watchdog
version0.1.1
created_at2025-09-18 02:09:33.189627+00
updated_at2025-09-20 02:33:05.406835+00
descriptionPortable watchdog for Rust services & jobs to avoid hangups even without a supervisor
homepagehttps://github.com/zertyz/ogre-watchdog
repositoryhttps://github.com/zertyz/ogre-watchdog
max_upload_size
id1844148
size47,781
Luiz Silveira (zertyz)

documentation

https://docs.rs/ogre-watchdog/

README

ogre-watchdog

Portable watchdog for Rust applications / services that run a lot of tiny units of work and must fail fast if code wedges.

Design target: event-driven systems (kHz workloads, sub-ms units) where subprocess isolation is too costly but rare hangs must not wedge the whole service or scheduled job.

Activation sequence

The watchdog is activated in the following sequence:

  • The program registers the watchdog with its timeout and callbacks configs
  • Each of the event executors inform "starting work on X"
  • The watchdog tracks the progress and enforces the specified healthy time threshold for it to complete
  • If exceeded, the executor is marked as "compromised" and a specified callback is activated with the context. Optionally, the thread in which that executor is running is demoted to "IDLE Priority" -- useful if hangouts tend to drain the CPU.
  • (upon later recovery, another callback is activated to update on the situation of the executor and the thread is promoted to its original priority)
  • If enough executors are compromised -- exceeding the given threshold -- the watchdog goes on with the "meltdown procedure", activating a specified callback
  • executors are informed not to start any new work -- "starting work on X" will return an error (meaning "ACCEPTING=false")
  • time is given to allow draining the events being processed before the meltdown started
  • a final "meltdown completed" callback can be specified
  • after the meltdown, that watchdog instance will stop and will cease to perform its functions.

Callbacks (and what they may do)

All callbacks must be non-blocking and avoid app locks, heavy allocation, and blocking I/O. Treat them as telemetry hooks.

  • on_started(executor + event) Observability-only. Do not lock/allocate/do I/O.

  • on_stalled(executor + event) Fires exactly once per unit when it first crosses stall_threshold. Under the demotion feature, ogre-watchdog may demote the OS thread priority of that executor to reduce CPU impact.

  • on_recovered(executor + event) Fires if an executor that was previously stalled is able to recover (finish the job) before meltdown. Use this to undo any dead-letter pre-marking you might have done on on_stalled_once. Under the demotion feature, ogre-watchdog will also restore the OS thread priority to its original value before calling this hook.

  • on_meltdown([array of stalled executors & events]) Final best-effort callback executed right before termination. You must terminate the process from this hook by calling std::process::exit(exit_code) after any minimal, non-blocking bookkeeping. Do not take app locks or perform heavy allocations or heavy I/O here.

Dead-letter enabler

Combine ogre-watchdog with a persistent dead-letter repository to ensure the service won't attempt to reprocess troubled payloads.

On process restart, a startup reaper makes sure the tagged events won't be offered to the executors again.

Pedantic approach:

  • On on_work_start(), reserve a slot in the dead-letter queue's durable storage
  • On on_stalled(), write the event payload to the reserved slot and publish the entry
  • On on_recovered & on_work_finish(), undo any DLQ pre-marking

Simpler approach -- use with caution:

A simpler, but slightly less safe approach, is to only publish the dead-letter events during on_meltdown(). When the meltdown procedure kicks in, your service might already be degraded to the point it won't be able to allocate or do heavy IO, so use this approach with care at the risk of having too complex on_meltdown() routines causing the program to delay its exit more than necessary, or never to be able to exit at all.

scheduled_job_deadletter example

The above approach is implemented, simplistically, on the scheduled_job_deadletter example. Please refer to the code example for more details.

Async support

This crate can be used in async programs, but observe the following:

  • It may be started straight from async contexts -- internally, a dedicated "sync" thread for monitoring will be spawned
  • All callbacks run on sync contexts, as almost all of them are driven from the dedicated thread
  • You can try to gracefully shutdown the application if on_meltdown() kicks in, but don't forget to spawn another dedicated thread to do so, calling std::process::exit(exit_code) on the callback's body. Be aware that:
    • the async runtime & code may be compromised and may never return from a graceful shutdown trigger

    • since the process is degraded, spawning new threads may not succeed -- e.g., RAM got exhausted.

Why this crate exists

This is akin to Kubernete's "Liveliness Probes" and to SystemD's "Watchdogs", but implemented in-process -- hence "portable" -- to bring similar functionalities to other execution environments.

When working in event-based systems, with a fixed set of executors and an unbounded list of work to do, some event's payload will, sooner or later, trigger an undetected bug -- or a known bug in a dependency you do not control -- like hitting a deadlock, never ending loop, or simply panic'ing. When this happens, your processing pipeline can first degrade, then stall and the whole service silently dies.

It happens that non-preemptively killing a thread in the same process is dangerous. It might be killed while changing a shared state, corrupting it or leaving a mutex locked -- this applies, for instance, to allocations/deallocations in the heap: if the thread is killed inside any of these operations, the whole heap of the process can be corrupted or a global lock won't ever be released.

Usually, the alternative is to implement the operations that might need to be forcibly cancelled on other processes.

But, on the other hand, if you run thousands of sub-millisecond tasks per second, you can’t afford to run each one in a subprocess. Also, if the likelihood of failures is nearly zero, then this crate may fill in the gap.

There are crates for systemd watchdog pings and crates for spawning processes, but there wasn’t a single, portable, in-process watchdog that:

  • observes per-work progress cheaply
  • enforces a minimum healthy executors policy
  • drains briefly and exits hard with a known exit code -- so that the process may be respawned
  • optionally demotes the CPU hog thread to IDLE
  • and runs a best-effort, time-boxed final callback (e.g., to mark dead-letter / "do not work on this again") without risking a hang.

Edge Cases

  • The callbacks code should be carefully crafted: if they hang, the watchdog will also hang and not fulfill its function.
  • Workers can hang during the draining phase of the meltdown procedure; we don’t try to detect mid-drain updates because we are exiting. this is realistically impossible, but still logically possible & uncovered scenario.
  • If your runtime uses pooled worker threads for many tasks (e.g., async executors), demotion is thread-level, not task-level. Prefer dedicated threads for untrusted work.
Commit count: 4

cargo fmt