| Crates.io | ogre-watchdog |
| lib.rs | ogre-watchdog |
| version | 0.1.1 |
| created_at | 2025-09-18 02:09:33.189627+00 |
| updated_at | 2025-09-20 02:33:05.406835+00 |
| description | Portable watchdog for Rust services & jobs to avoid hangups even without a supervisor |
| homepage | https://github.com/zertyz/ogre-watchdog |
| repository | https://github.com/zertyz/ogre-watchdog |
| max_upload_size | |
| id | 1844148 |
| size | 47,781 |
Portable watchdog for Rust applications / services that run a lot of tiny units of work and must fail fast if code wedges.
Design target: event-driven systems (kHz workloads, sub-ms units) where subprocess isolation is too costly but rare hangs must not wedge the whole service or scheduled job.
The watchdog is activated in the following sequence:
All callbacks must be non-blocking and avoid app locks, heavy allocation, and blocking I/O. Treat them as telemetry hooks.
on_started(executor + event)
Observability-only. Do not lock/allocate/do I/O.
on_stalled(executor + event)
Fires exactly once per unit when it first crosses stall_threshold. Under the demotion feature, ogre-watchdog may demote the OS thread priority of that executor to reduce CPU impact.
on_recovered(executor + event)
Fires if an executor that was previously stalled is able to recover (finish the job) before meltdown.
Use this to undo any dead-letter pre-marking you might have done on on_stalled_once.
Under the demotion feature, ogre-watchdog will also restore the OS thread priority to its original value before calling this hook.
on_meltdown([array of stalled executors & events])
Final best-effort callback executed right before termination. You must terminate the process from this hook by calling std::process::exit(exit_code)
after any minimal, non-blocking bookkeeping. Do not take app locks or perform heavy allocations or heavy I/O here.
Combine ogre-watchdog with a persistent dead-letter repository to ensure the service won't attempt to reprocess troubled payloads.
On process restart, a startup reaper makes sure the tagged events won't be offered to the executors again.
on_work_start(), reserve a slot in the dead-letter queue's durable storageon_stalled(), write the event payload to the reserved slot and publish the entryon_recovered & on_work_finish(), undo any DLQ pre-markingA simpler, but slightly less safe approach, is to only publish the dead-letter events during on_meltdown().
When the meltdown procedure kicks in, your service might already be degraded to the point it won't be able to
allocate or do heavy IO, so use this approach with care at the risk of having too complex on_meltdown()
routines causing the program to delay its exit more than necessary, or never to be able to exit at all.
scheduled_job_deadletter exampleThe above approach is implemented, simplistically, on the scheduled_job_deadletter example.
Please refer to the code example for more details.
This crate can be used in async programs, but observe the following:
on_meltdown() kicks in, but don't forget to spawn another
dedicated thread to do so, calling std::process::exit(exit_code) on the callback's body.
Be aware that:
the async runtime & code may be compromised and may never return from a graceful shutdown trigger
since the process is degraded, spawning new threads may not succeed -- e.g., RAM got exhausted.
This is akin to Kubernete's "Liveliness Probes" and to SystemD's "Watchdogs", but implemented in-process -- hence "portable" -- to bring similar functionalities to other execution environments.
When working in event-based systems, with a fixed set of executors and an unbounded list of work to do, some event's payload will, sooner or later, trigger an undetected bug -- or a known bug in a dependency you do not control -- like hitting a deadlock, never ending loop, or simply panic'ing. When this happens, your processing pipeline can first degrade, then stall and the whole service silently dies.
It happens that non-preemptively killing a thread in the same process is dangerous. It might be killed while changing a shared state, corrupting it or leaving a mutex locked -- this applies, for instance, to allocations/deallocations in the heap: if the thread is killed inside any of these operations, the whole heap of the process can be corrupted or a global lock won't ever be released.
Usually, the alternative is to implement the operations that might need to be forcibly cancelled on other processes.
But, on the other hand, if you run thousands of sub-millisecond tasks per second, you can’t afford to run each one in a subprocess. Also, if the likelihood of failures is nearly zero, then this crate may fill in the gap.
There are crates for systemd watchdog pings and crates for spawning processes, but there wasn’t a single, portable, in-process watchdog that: