Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)
Ampliflow
Advanced AI frontier lab and business growth agency. Helping UK businesses deploy agentic AI systems.

The Hermes Agent install is a one-line script. Keeping it reliably running through the months that follow is the actual work. Most Hermes deployments fail not because the software is buggy but because nobody set up the operational scaffolding before the first thing went wrong. This guide covers the full monitoring stack we run for our own production Hermes deployment + UK client deployments — systemd auto-restart with the right semantics, healthcheck pings, log routing, the auto-update script that survived a 98-commit upgrade, and the recovery playbook that exists because of the 30 April 2026 outage that taught us what to write down.
Last updated: May 2026 · Covers Hermes Agent v0.13 ops patterns · Based on 40+ days of live production data + the post-mortem from the 30 April outage — see 40-day teardown
TL;DR:
- Without
Restart=always(noton-failure) in systemd, your gateway will silently die on the first transient error - Healthchecks.io free tier is enough for offsite uptime monitoring + alerting
- Logrotate with 14-day retention + journald is the right log layer for most deployments
- An automatic update script with rollback is essential — Hermes ships fast and a bad update costs hours of recovery
- The recovery playbook fits on one A4 page; print it and tape it next to the laptop you SSH from
The 30 April 2026 outage — what it taught us
Our Hermes deployment went down at 02:14 UTC and stayed down for 62 hours over the bank holiday weekend. Root cause: the gateway received an unhandled exception from a model provider rate-limit response, exited with status 0, and Restart=on-failure (which we were using at the time) did not retry. The agent was offline through the bank holiday before anyone noticed.
The fix was a series of changes that became the patterns documented in this guide. We publish them because most Hermes guides skip this part — and the patterns are universal across long-lived Linux services, not just Hermes.
The full incident timeline:
| Time (UTC) | Event |
|---|---|
| 02:14, Apr 30 | Gateway receives 429 from model provider, raises unhandled exception, exits status 0 |
| 02:14, Apr 30 | systemd's `Restart=on-failure` semantics: status 0 = clean exit, no restart |
| Apr 30 - May 2 | Bank holiday weekend, no monitoring alert configured, nobody notices |
| 16:32, May 2 | Founder messages the agent, no response. Suspects WhatsApp issue. |
| 17:08, May 2 | Founder SSHs in, finds gateway stopped. `journalctl --user -u hermes-gateway --since "5 days ago"` shows the original crash. |
| 17:15, May 2 | Manual restart. Agent recovers. |
| 17:15-19:00 | Post-mortem + design of the patterns below |
Total downtime: 62 hours 1 minute. Total customer impact: zero (this was our internal ops agent, not customer-facing). But the lesson was free — and the patterns below ensure it never repeats.
Layer 1 — systemd unit configuration
The single most important file in the whole monitoring stack.
The unit file at ~/.config/systemd/user/hermes-gateway.service:
`ini [Unit] Description=Hermes Agent Gateway After=network-online.target Wants=network-online.target
[Service] Type=simple ExecStart=%h/.hermes/venv/bin/hermes gateway --foreground Restart=always RestartSec=60 RestartSteps=5 RestartMaxDelaySec=300 RestartForceExitStatus=75 StartLimitIntervalSec=0 TimeoutStartSec=300 TimeoutStopSec=60
[Install] WantedBy=default.target `
The non-obvious settings:
- `Restart=always` — restart on any exit, clean or otherwise. This single change would have prevented the 30 April outage.
on-failureis wrong for this workload. - `RestartSec=60` + `RestartSteps=5` + `RestartMaxDelaySec=300` — exponential backoff. First retry at 60s, then 120s, 180s, 240s, 300s. Stops tight restart loops when the model provider is having a sustained outage.
- `StartLimitIntervalSec=0` — disable systemd's start-rate limit. We want to retry forever, not give up after a few attempts.
- `TimeoutStartSec=300` — first start can take a few minutes if Hermes needs to download model context.
Enable with lingering so the service survives logout:
`bash systemctl --user daemon-reload systemctl --user enable hermes-gateway.service sudo loginctl enable-linger $USER systemctl --user start hermes-gateway.service `
Verify with systemctl --user status hermes-gateway.service. You want active (running) with a positive uptime.
The same configuration goes for hermes-dashboard.service if you run the dashboard, plus any timers (auto-update timer, daily-pulse timer, etc.). Apply Restart=always to all of them.
Layer 2 — Healthchecks.io for offsite alerting
The 30 April outage happened because our local logging caught the failure but no human was watching the local logs. Offsite monitoring with notification to a channel you actually check (WhatsApp, in our case) is the second-most-important pattern.
Healthchecks.io free tier gives you:
- 20 checks per account
- Email + WhatsApp + Slack + custom webhook alerts
- Cron-style scheduling ("alert me if no ping in 10 minutes")
- Free forever, no credit card
The pattern: Hermes pings a Healthchecks URL every 2 minutes. If Healthchecks doesn't see a ping for 10 minutes, it alerts the on-call channel.
In ~/.hermes/skills/heartbeat/SKILL.md:
`yaml
description: Pings Healthchecks.io heartbeat URL. Runs every 2 minutes via cron. disable-model-invocation: true
!curl -fsS -m 10 --retry 3 -o /dev/null https://hc-ping.com/<your-uuid> `
Then schedule it via the Hermes cron config:
`yaml crons:
- skill: heartbeat
schedule: "/2 *" `
Configure Healthchecks to alert your WhatsApp via the integrations panel. Cost: £0.
For deployments where WhatsApp alerts are insufficient (you need PagerDuty integration, multi-tier escalation, etc.), upgrade to Healthchecks paid (~£3/month) or move to Better Uptime / PagerDuty / OpsGenie.
Layer 3 — Log routing + retention
Hermes generates four log streams. All four need rotation + retention.
The streams:
| Log | Source | Useful for |
|---|---|---|
| `~/.hermes/logs/agent.log` | Application logic, skill execution | Debugging skill behaviour |
| `~/.hermes/logs/auto-update.log` | The auto-update script | Verifying nightly updates worked |
| `~/.hermes/whatsapp/bridge.log` | Baileys WhatsApp bridge | WhatsApp connection issues |
| `journalctl --user -u hermes-gateway` | systemd journal for the gateway | Crash diagnosis, restart counting |
Configure logrotate at /etc/logrotate.d/hermes:
` ~/.hermes/logs/*.log { daily rotate 14 compress delaycompress missingok notifempty copytruncate }
~/.hermes/whatsapp/bridge.log { daily rotate 14 compress delaycompress missingok notifempty copytruncate } `
The copytruncate directive matters — without it, logrotate's default rename-then-create breaks long-lived processes that have file handles open on the original log.
For audit-log retention (7 years for FCA-regulated, longer for some healthcare scenarios), ship the audit logs to immutable storage (S3 with object-lock, your SIEM) — covered in our Hermes Agent security and GDPR guide.
Layer 4 — Auto-update without breaking things
Hermes ships updates roughly weekly. Auto-updating without a verification step has burned us. Auto-updating with a verification step is fine.
The script at ~/.hermes/bin/auto-update.sh (~100 lines of bash):
`bash
#!/bin/bash set -euo pipefail
LOG=/home/ubuntu/.hermes/logs/auto-update.log STATUS=/home/ubuntu/.hermes/.auto-update-status.json BACKUP_DIR=/home/ubuntu/.hermes/backups TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
log() { echo "[$TIMESTAMP] $*" | tee -a "$LOG"; }
log "Pre-update snapshot..." cd ~/.hermes/hermes-agent && PRE_REV=$(git rev-parse HEAD) SNAPSHOT="$BACKUP_DIR/pre-update-$(date -u +%Y-%m-%d-%H%M%S).zip" ~/.hermes/hermes-agent/venv/bin/hermes export "$SNAPSHOT"
AVAIL=$(df ~/.hermes | awk 'NR==2 {print $4}') if [ "$AVAIL" -lt 524288 ]; then log "ERROR: less than 512 MB free, skipping update" exit 1 fi
log "Running hermes update..." if ! ~/.hermes/hermes-agent/venv/bin/hermes update --yes 2>&1 | tee -a "$LOG"; then log "Update failed, rolling back to $PRE_REV" cd ~/.hermes/hermes-agent && git reset --hard "$PRE_REV" ~/.hermes/hermes-agent/venv/bin/hermes import --force "$SNAPSHOT" systemctl --user restart hermes-gateway.service echo "{\"status\":\"failed\",\"timestamp\":\"$TIMESTAMP\"}" > "$STATUS"
curl -fsS http://localhost:9119/send -d '{"message":"Hermes auto-update FAILED, rolled back."}' exit 1 fi
if [ -f ~/.hermes/bin/auto-update-patch3.py ]; then python3 ~/.hermes/bin/auto-update-patch3.py
grep -q "**kwargs" ~/.hermes/hermes-agent/path/to/patched/file.py || { log "ERROR: patch3 reapplication failed" exit 1 } fi
log "Restarting gateway..." systemctl --user restart hermes-gateway.service sleep 30 if ! systemctl --user is-active hermes-gateway.service > /dev/null; then log "Gateway did not start, rolling back" cd ~/.hermes/hermes-agent && git reset --hard "$PRE_REV" ~/.hermes/hermes-agent/venv/bin/hermes import --force "$SNAPSHOT" systemctl --user restart hermes-gateway.service echo "{\"status\":\"rollback\",\"timestamp\":\"$TIMESTAMP\"}" > "$STATUS" exit 1 fi
ls -t "$BACKUP_DIR"/pre-update-*.zip | tail -n +8 | xargs -r rm
echo "{\"status\":\"success\",\"timestamp\":\"$TIMESTAMP\",\"pre_rev\":\"$PRE_REV\"}" > "$STATUS" log "Update complete." `
Triggered by a systemd timer at ~/.config/systemd/user/hermes-auto-update.timer:
`ini [Unit] Description=Daily Hermes auto-update
[Timer] OnCalendar=--* 03:00:00 Europe/London Persistent=true RandomizedDelaySec=600
[Install] WantedBy=timers.target `
The Persistent=true is critical — if the server was offline at 03:00, the update runs as soon as it comes back online. Without it, an overnight reboot means a full day of skipped updates.
The script has been live-tested on a 98-commit upgrade in early May 2026 (Hermes v0.12.0 → v0.13.0 with config v22 → v23 migration). It worked.
Layer 5 — The recovery playbook
Print this. Tape it next to the laptop you SSH from. When something goes wrong, you don't want to be reading documentation.
` HERMES RECOVERY PLAYBOOK — v1.0 (May 2026) ==========================================
- Confirm the agent is actually down
ssh ubuntu@<server> systemctl --user status hermes-gateway.service
- Try the simple restart
systemctl --user restart hermes-gateway.service sleep 30 systemctl --user is-active hermes-gateway.service
- If restart fails, check the model provider
curl -fsS https://status.anthropic.com/api/v2/status.json
- If the provider is healthy and Hermes still won't start, check logs
journalctl --user -u hermes-gateway.service --since "30 minutes ago"
- Rollback to the most recent backup
ls -t ~/.hermes/backups/pre-update-*.zip | head -1
~/.hermes/hermes-agent/venv/bin/hermes import --force <filename> systemctl --user restart hermes-gateway.service
- If rollback also fails, file an issue
https://github.com/NousResearch/hermes-agent/issues
- While you wait for fix
systemctl --user disable --now hermes-auto-update.timer
KEY LOCATIONS
- Hermes home: ~/.hermes/
- Backups: ~/.hermes/backups/
- Logs: ~/.hermes/logs/, journalctl --user -u hermes-gateway
- Config: ~/.hermes/config.yaml
- Auto-update script: ~/.hermes/bin/auto-update.sh
`
The playbook has been tested in anger twice in the first 40 days (the 30 April outage + one minor model-provider outage in May). Both times it recovered the agent in under 10 minutes.
What to monitor — the metrics that matter
Five metrics, no more. Over-monitoring creates alert fatigue and distracts from the signals that matter.
| Metric | How to check | Alert threshold | ||
|---|---|---|---|---|
| Gateway uptime | Healthchecks.io heartbeat (above) | No ping for 10 min | ||
| Disk space | `df ~/.hermes` daily | <10% free | ||
| Memory usage | `ps aux | grep hermes-gateway | awk '{print $6}'` every 5 min | >4 GB on a 6 GB server |
| Model-provider error rate | Parse agent.log for HTTP 4xx/5xx, last hour | >10% errors | ||
| Auto-update status | `cat ~/.hermes/.auto-update-status.json` daily | status != "success" yesterday |
For most UK SME deployments running a single Hermes instance, the five above are enough. Add more only when a real incident teaches you the gap.
Frequently asked questions
Why Restart=always instead of Restart=on-failure?
on-failure doesn't restart on exit status 0 — it interprets a clean exit as "the process meant to stop." But Hermes can exit cleanly after an unhandled exception (the Python process catches the exception, logs it, calls sys.exit(0)). With on-failure, that's a permanent stop. With always, it's a transient stop that exponential backoff handles correctly.
How often does Hermes ship breaking updates?
Roughly monthly for major versions, weekly for patch versions. Patch versions are typically safe; major versions occasionally need config migrations. The auto-update script's verification step catches breaking changes before they take down production.
What about Kubernetes / Docker Swarm / Nomad?
For a single Hermes instance on a single server, systemd is the right scheduler — simpler, more reliable, no orchestrator overhead. For multi-instance deployments (multiple specialist harnesses, geographic redundancy), Docker Compose or Kubernetes makes sense. The systemd patterns translate directly — Restart=always becomes restart: always in compose, or restartPolicy: Always in K8s.
How do I know if my Healthchecks alert is reaching me?
Schedule a manual test: pause the heartbeat skill for 11 minutes (1 minute past the 10-minute threshold), confirm you get the alert. Do this on day one + every six months. Untested alerts are a Schroedinger's monitoring system — you don't know if they work until you need them.
Can I run the auto-update script less often than daily?
Yes — change OnCalendar=daily to OnCalendar=weekly or OnCalendar=Sun *-*-* 03:00:00 Europe/London. We recommend daily because Hermes patch versions occasionally fix critical bugs and the verification + rollback semantics make daily updates safe.
What happens if the auto-update script itself has a bug?
The worst case: script runs, fails to update, rolls back, agent stays on previous version. You get an alert, you investigate, you patch the script. The agent doesn't go down because of a script bug — the rollback path is the safety net.
Is there a hosted Hermes monitoring product?
Not as of May 2026. The patterns above are universal Linux service monitoring; any standard monitoring tool (Prometheus + Grafana, Datadog, New Relic) works on top. Healthchecks.io is the cheapest path; Better Uptime is the next step up.
How do I monitor model-provider costs?
Parse the agent.log for token counts per call. Total daily/weekly. Compare against your subscription's usage allowance. Anthropic and OpenAI both expose monthly usage via their dashboards — pull those into a daily report skill (one of the use cases in our Hermes use cases guide).
Related reading
- ↑ How to Deploy Hermes Agent — UK Business Complete Guide — the foundational deployment pillar
- ↔ Hermes Agent on Oracle Cloud Free Tier — UK Guide — the underlying server platform
- ↔ Hermes Agent Security & GDPR for UK Business — the compliance posture that complements operational reliability
- ↔ Hermes Agent — Real Business Use Cases — the use cases that depend on this reliability layer
What should you do next?
The patterns above are the difference between a Hermes deployment that survives a year and one that has a 62-hour outage in month two. Most of the work is one-time setup — once configured, monitoring runs itself.
See how Ampliflow runs Hermes-powered automations for clients →
Or to scope your specific monitoring + recovery setup: Book a free Hermes deployment review →