AI Agents22 April 2026Updated 8 June 202611 min read

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

Sajad Saleem

Co-founder of Ampliflow. Builds AI automation, websites, SEO/AEO, and growth systems for UK SMEs.

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

The Hermes Agent install is a one-line script. Keeping it reliably running through the months that follow is the actual work. Most Hermes deployments fail not because the software is buggy but because nobody set up the operational scaffolding before the first thing went wrong. This guide covers the full monitoring stack we run for our own production Hermes deployment + UK client deployments — systemd auto-restart with the right semantics, healthcheck pings, log routing, the auto-update script that survived a 98-commit upgrade, and the recovery playbook that exists because of the 30 April 2026 outage that taught us what to write down.

Last updated: May 2026 · Covers Hermes Agent v0.13 ops patterns · Based on 40+ days of live production data + the post-mortem from the 30 April outage — see 40-day teardown

TL;DR:

Without Restart=always (not on-failure) in systemd, your gateway will silently die on the first transient error
Healthchecks.io free tier is enough for offsite uptime monitoring + alerting
Logrotate with 14-day retention + journald is the right log layer for most deployments
An automatic update script with rollback is essential — Hermes ships fast and a bad update costs hours of recovery
The recovery playbook fits on one A4 page; print it and tape it next to the laptop you SSH from

The 30 April 2026 outage — what it taught us

Our Hermes deployment went down at 02:14 UTC and stayed down for 62 hours over the bank holiday weekend. Root cause: the gateway received an unhandled exception from a model provider rate-limit response, exited with status 0, and Restart=on-failure (which we were using at the time) did not retry. The agent was offline through the bank holiday before anyone noticed.

The fix was a series of changes that became the patterns documented in this guide. We publish them because most Hermes guides skip this part — and the patterns are universal across long-lived Linux services, not just Hermes.

The full incident timeline:

Time (UTC)	Event
02:14, Apr 30	Gateway receives 429 from model provider, raises unhandled exception, exits status 0
02:14, Apr 30	systemd's `Restart=on-failure` semantics: status 0 = clean exit, no restart
Apr 30 - May 2	Bank holiday weekend, no monitoring alert configured, nobody notices
16:32, May 2	Founder messages the agent, no response. Suspects WhatsApp issue.
17:08, May 2	Founder SSHs in, finds gateway stopped. `journalctl --user -u hermes-gateway --since "5 days ago"` shows the original crash.
17:15, May 2	Manual restart. Agent recovers.
17:15-19:00	Post-mortem + design of the patterns below

Total downtime: 62 hours 1 minute. Total customer impact: zero (this was our internal ops agent, not customer-facing). But the lesson was free — and the patterns below ensure it never repeats.

Layer 1 — systemd unit configuration

The single most important file in the whole monitoring stack.

The unit file at ~/.config/systemd/user/hermes-gateway.service:

`ini [Unit] Description=Hermes Agent Gateway After=network-online.target Wants=network-online.target

[Service] Type=simple ExecStart=%h/.hermes/venv/bin/hermes gateway --foreground Restart=always RestartSec=60 RestartSteps=5 RestartMaxDelaySec=300 RestartForceExitStatus=75 StartLimitIntervalSec=0 TimeoutStartSec=300 TimeoutStopSec=60

[Install] WantedBy=default.target `

The non-obvious settings:

`Restart=always` — restart on any exit, clean or otherwise. This single change would have prevented the 30 April outage. on-failure is wrong for this workload.
`RestartSec=60` + `RestartSteps=5` + `RestartMaxDelaySec=300` — exponential backoff. First retry at 60s, then 120s, 180s, 240s, 300s. Stops tight restart loops when the model provider is having a sustained outage.
`StartLimitIntervalSec=0` — disable systemd's start-rate limit. We want to retry forever, not give up after a few attempts.
`TimeoutStartSec=300` — first start can take a few minutes if Hermes needs to download model context.

Enable with lingering so the service survives logout:

`bash systemctl --user daemon-reload systemctl --user enable hermes-gateway.service sudo loginctl enable-linger $USER systemctl --user start hermes-gateway.service `

Verify with systemctl --user status hermes-gateway.service. You want active (running) with a positive uptime.

The same configuration goes for hermes-dashboard.service if you run the dashboard, plus any timers (auto-update timer, daily-pulse timer, etc.). Apply Restart=always to all of them.

Layer 2 — Healthchecks.io for offsite alerting

The 30 April outage happened because our local logging caught the failure but no human was watching the local logs. Offsite monitoring with notification to a channel you actually check (WhatsApp, in our case) is the second-most-important pattern.

Healthchecks.io free tier gives you:

20 checks per account
Email + WhatsApp + Slack + custom webhook alerts
Cron-style scheduling ("alert me if no ping in 10 minutes")
Free forever, no credit card

The pattern: Hermes pings a Healthchecks URL every 2 minutes. If Healthchecks doesn't see a ping for 10 minutes, it alerts the on-call channel.

In ~/.hermes/skills/heartbeat/SKILL.md:

`yaml

description: Pings Healthchecks.io heartbeat URL. Runs every 2 minutes via cron. disable-model-invocation: true

!curl -fsS -m 10 --retry 3 -o /dev/null https://hc-ping.com/<your-uuid> `

Then schedule it via the Hermes cron config:

`yaml crons:

skill: heartbeat

schedule: "/2 *" `

Configure Healthchecks to alert your WhatsApp via the integrations panel. Cost: £0.

For deployments where WhatsApp alerts are insufficient (you need PagerDuty integration, multi-tier escalation, etc.), upgrade to Healthchecks paid (~£3/month) or move to Better Uptime / PagerDuty / OpsGenie.

Layer 3 — Log routing + retention

Hermes generates four log streams. All four need rotation + retention.

The streams:

Log	Source	Useful for
`~/.hermes/logs/agent.log`	Application logic, skill execution	Debugging skill behaviour
`~/.hermes/logs/auto-update.log`	The auto-update script	Verifying nightly updates worked
`~/.hermes/whatsapp/bridge.log`	Baileys WhatsApp bridge	WhatsApp connection issues
`journalctl --user -u hermes-gateway`	systemd journal for the gateway	Crash diagnosis, restart counting

Configure logrotate at /etc/logrotate.d/hermes:

` ~/.hermes/logs/*.log { daily rotate 14 compress delaycompress missingok notifempty copytruncate }

~/.hermes/whatsapp/bridge.log { daily rotate 14 compress delaycompress missingok notifempty copytruncate } `

The copytruncate directive matters — without it, logrotate's default rename-then-create breaks long-lived processes that have file handles open on the original log.

For audit-log retention (7 years for FCA-regulated, longer for some healthcare scenarios), ship the audit logs to immutable storage (S3 with object-lock, your SIEM) — covered in our Hermes Agent security and GDPR guide.

Layer 4 — Auto-update without breaking things

Hermes ships updates roughly weekly. Auto-updating without a verification step has burned us. Auto-updating with a verification step is fine.

The script at ~/.hermes/bin/auto-update.sh (~100 lines of bash):

`bash

#!/bin/bash set -euo pipefail

LOG=/home/ubuntu/.hermes/logs/auto-update.log STATUS=/home/ubuntu/.hermes/.auto-update-status.json BACKUP_DIR=/home/ubuntu/.hermes/backups TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)

log() { echo "[$TIMESTAMP] $*" | tee -a "$LOG"; }

log "Pre-update snapshot..." cd ~/.hermes/hermes-agent && PRE_REV=$(git rev-parse HEAD) SNAPSHOT="$BACKUP_DIR/pre-update-$(date -u +%Y-%m-%d-%H%M%S).zip" ~/.hermes/hermes-agent/venv/bin/hermes export "$SNAPSHOT"

AVAIL=$(df ~/.hermes | awk 'NR==2 {print $4}') if [ "$AVAIL" -lt 524288 ]; then log "ERROR: less than 512 MB free, skipping update" exit 1 fi

log "Running hermes update..." if ! ~/.hermes/hermes-agent/venv/bin/hermes update --yes 2>&1 | tee -a "$LOG"; then log "Update failed, rolling back to $PRE_REV" cd ~/.hermes/hermes-agent && git reset --hard "$PRE_REV" ~/.hermes/hermes-agent/venv/bin/hermes import --force "$SNAPSHOT" systemctl --user restart hermes-gateway.service echo "{\"status\":\"failed\",\"timestamp\":\"$TIMESTAMP\"}" > "$STATUS"

curl -fsS http://localhost:9119/send -d '{"message":"Hermes auto-update FAILED, rolled back."}' exit 1 fi

if [ -f ~/.hermes/bin/auto-update-patch3.py ]; then python3 ~/.hermes/bin/auto-update-patch3.py

grep -q "**kwargs" ~/.hermes/hermes-agent/path/to/patched/file.py || { log "ERROR: patch3 reapplication failed" exit 1 } fi

log "Restarting gateway..." systemctl --user restart hermes-gateway.service sleep 30 if ! systemctl --user is-active hermes-gateway.service > /dev/null; then log "Gateway did not start, rolling back" cd ~/.hermes/hermes-agent && git reset --hard "$PRE_REV" ~/.hermes/hermes-agent/venv/bin/hermes import --force "$SNAPSHOT" systemctl --user restart hermes-gateway.service echo "{\"status\":\"rollback\",\"timestamp\":\"$TIMESTAMP\"}" > "$STATUS" exit 1 fi

ls -t "$BACKUP_DIR"/pre-update-*.zip | tail -n +8 | xargs -r rm

echo "{\"status\":\"success\",\"timestamp\":\"$TIMESTAMP\",\"pre_rev\":\"$PRE_REV\"}" > "$STATUS" log "Update complete." `

Triggered by a systemd timer at ~/.config/systemd/user/hermes-auto-update.timer:

`ini [Unit] Description=Daily Hermes auto-update

[Timer] OnCalendar=--* 03:00:00 Europe/London Persistent=true RandomizedDelaySec=600

[Install] WantedBy=timers.target `

The Persistent=true is critical — if the server was offline at 03:00, the update runs as soon as it comes back online. Without it, an overnight reboot means a full day of skipped updates.

The script has been live-tested on a 98-commit upgrade in early May 2026 (Hermes v0.12.0 → v0.13.0 with config v22 → v23 migration). It worked.

Layer 5 — The recovery playbook

Print this. Tape it next to the laptop you SSH from. When something goes wrong, you don't want to be reading documentation.

` HERMES RECOVERY PLAYBOOK — v1.0 (May 2026) ==========================================

Confirm the agent is actually down

ssh ubuntu@<server> systemctl --user status hermes-gateway.service

Try the simple restart

systemctl --user restart hermes-gateway.service sleep 30 systemctl --user is-active hermes-gateway.service

If restart fails, check the model provider

curl -fsS https://status.anthropic.com/api/v2/status.json

If the provider is healthy and Hermes still won't start, check logs

journalctl --user -u hermes-gateway.service --since "30 minutes ago"

Rollback to the most recent backup

ls -t ~/.hermes/backups/pre-update-*.zip | head -1

~/.hermes/hermes-agent/venv/bin/hermes import --force <filename> systemctl --user restart hermes-gateway.service

If rollback also fails, file an issue

https://github.com/NousResearch/hermes-agent/issues

While you wait for fix

systemctl --user disable --now hermes-auto-update.timer

KEY LOCATIONS

Hermes home: ~/.hermes/
Backups: ~/.hermes/backups/
Logs: ~/.hermes/logs/, journalctl --user -u hermes-gateway
Config: ~/.hermes/config.yaml
Auto-update script: ~/.hermes/bin/auto-update.sh

The playbook has been tested in anger twice in the first 40 days (the 30 April outage + one minor model-provider outage in May). Both times it recovered the agent in under 10 minutes.

What to monitor — the metrics that matter

Five metrics, no more. Over-monitoring creates alert fatigue and distracts from the signals that matter.

Metric	How to check	Alert threshold
Gateway uptime	Healthchecks.io heartbeat (above)	No ping for 10 min
Disk space	`df ~/.hermes` daily	<10% free
Memory usage	`ps aux	grep hermes-gateway	awk '{print $6}'` every 5 min	>4 GB on a 6 GB server
Model-provider error rate	Parse agent.log for HTTP 4xx/5xx, last hour	>10% errors
Auto-update status	`cat ~/.hermes/.auto-update-status.json` daily	status != "success" yesterday

For most UK SME deployments running a single Hermes instance, the five above are enough. Add more only when a real incident teaches you the gap.

Frequently asked questions

Why `Restart=always` instead of `Restart=on-failure`?

on-failure doesn't restart on exit status 0 — it interprets a clean exit as "the process meant to stop." But Hermes can exit cleanly after an unhandled exception (the Python process catches the exception, logs it, calls sys.exit(0)). With on-failure, that's a permanent stop. With always, it's a transient stop that exponential backoff handles correctly.

How often does Hermes ship breaking updates?

Roughly monthly for major versions, weekly for patch versions. Patch versions are typically safe; major versions occasionally need config migrations. The auto-update script's verification step catches breaking changes before they take down production.

What about Kubernetes / Docker Swarm / Nomad?

For a single Hermes instance on a single server, systemd is the right scheduler — simpler, more reliable, no orchestrator overhead. For multi-instance deployments (multiple specialist harnesses, geographic redundancy), Docker Compose or Kubernetes makes sense. The systemd patterns translate directly — Restart=always becomes restart: always in compose, or restartPolicy: Always in K8s.

How do I know if my Healthchecks alert is reaching me?

Schedule a manual test: pause the heartbeat skill for 11 minutes (1 minute past the 10-minute threshold), confirm you get the alert. Do this on day one + every six months. Untested alerts are a Schroedinger's monitoring system — you don't know if they work until you need them.

Can I run the auto-update script less often than daily?

Yes — change OnCalendar=daily to OnCalendar=weekly or OnCalendar=Sun *-*-* 03:00:00 Europe/London. We recommend daily because Hermes patch versions occasionally fix critical bugs and the verification + rollback semantics make daily updates safe.

What happens if the auto-update script itself has a bug?

The worst case: script runs, fails to update, rolls back, agent stays on previous version. You get an alert, you investigate, you patch the script. The agent doesn't go down because of a script bug — the rollback path is the safety net.

Is there a hosted Hermes monitoring product?

Not as of May 2026. The patterns above are universal Linux service monitoring; any standard monitoring tool (Prometheus + Grafana, Datadog, New Relic) works on top. Healthchecks.io is the cheapest path; Better Uptime is the next step up.

How do I monitor model-provider costs?

Parse the agent.log for token counts per call. Total daily/weekly. Compare against your subscription's usage allowance. Anthropic and OpenAI both expose monthly usage via their dashboards — pull those into a daily report skill (one of the use cases in our Hermes use cases guide).

↑ How to Deploy Hermes Agent — UK Business Complete Guide — the foundational deployment pillar
↔ Hermes Agent on Oracle Cloud Free Tier — UK Guide — the underlying server platform
↔ Hermes Agent Security & GDPR for UK Business — the compliance posture that complements operational reliability
↔ Hermes Agent — Real Business Use Cases — the use cases that depend on this reliability layer

What should you do next?

The patterns above are the difference between a Hermes deployment that survives a year and one that has a 62-hour outage in month two. Most of the work is one-time setup — once configured, monitoring runs itself.

See how Ampliflow runs Hermes-powered automations for clients →

Or to scope your specific monitoring + recovery setup: Book a free Hermes deployment review →

Back to Read

AI Agents22 April 2026Updated 8 June 202611 min read

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

Sajad Saleem

Co-founder of Ampliflow. Builds AI automation, websites, SEO/AEO, and growth systems for UK SMEs.

Last updated: May 2026 · Covers Hermes Agent v0.13 ops patterns · Based on 40+ days of live production data + the post-mortem from the 30 April outage — see 40-day teardown

TL;DR:

Without Restart=always (not on-failure) in systemd, your gateway will silently die on the first transient error
Healthchecks.io free tier is enough for offsite uptime monitoring + alerting
Logrotate with 14-day retention + journald is the right log layer for most deployments
An automatic update script with rollback is essential — Hermes ships fast and a bad update costs hours of recovery
The recovery playbook fits on one A4 page; print it and tape it next to the laptop you SSH from

The 30 April 2026 outage — what it taught us

The full incident timeline:

Time (UTC)	Event
02:14, Apr 30	Gateway receives 429 from model provider, raises unhandled exception, exits status 0
02:14, Apr 30	systemd's `Restart=on-failure` semantics: status 0 = clean exit, no restart
Apr 30 - May 2	Bank holiday weekend, no monitoring alert configured, nobody notices
16:32, May 2	Founder messages the agent, no response. Suspects WhatsApp issue.
17:08, May 2	Founder SSHs in, finds gateway stopped. `journalctl --user -u hermes-gateway --since "5 days ago"` shows the original crash.
17:15, May 2	Manual restart. Agent recovers.
17:15-19:00	Post-mortem + design of the patterns below

Total downtime: 62 hours 1 minute. Total customer impact: zero (this was our internal ops agent, not customer-facing). But the lesson was free — and the patterns below ensure it never repeats.

Layer 1 — systemd unit configuration

The single most important file in the whole monitoring stack.

The unit file at ~/.config/systemd/user/hermes-gateway.service:

`ini [Unit] Description=Hermes Agent Gateway After=network-online.target Wants=network-online.target

[Install] WantedBy=default.target `

The non-obvious settings:

`Restart=always` — restart on any exit, clean or otherwise. This single change would have prevented the 30 April outage. on-failure is wrong for this workload.
`RestartSec=60` + `RestartSteps=5` + `RestartMaxDelaySec=300` — exponential backoff. First retry at 60s, then 120s, 180s, 240s, 300s. Stops tight restart loops when the model provider is having a sustained outage.
`StartLimitIntervalSec=0` — disable systemd's start-rate limit. We want to retry forever, not give up after a few attempts.
`TimeoutStartSec=300` — first start can take a few minutes if Hermes needs to download model context.

Enable with lingering so the service survives logout:

`bash systemctl --user daemon-reload systemctl --user enable hermes-gateway.service sudo loginctl enable-linger $USER systemctl --user start hermes-gateway.service `

Verify with systemctl --user status hermes-gateway.service. You want active (running) with a positive uptime.

The same configuration goes for hermes-dashboard.service if you run the dashboard, plus any timers (auto-update timer, daily-pulse timer, etc.). Apply Restart=always to all of them.

Layer 2 — Healthchecks.io for offsite alerting

Healthchecks.io free tier gives you:

20 checks per account
Email + WhatsApp + Slack + custom webhook alerts
Cron-style scheduling ("alert me if no ping in 10 minutes")
Free forever, no credit card

The pattern: Hermes pings a Healthchecks URL every 2 minutes. If Healthchecks doesn't see a ping for 10 minutes, it alerts the on-call channel.

In ~/.hermes/skills/heartbeat/SKILL.md:

`yaml

description: Pings Healthchecks.io heartbeat URL. Runs every 2 minutes via cron. disable-model-invocation: true

!curl -fsS -m 10 --retry 3 -o /dev/null https://hc-ping.com/<your-uuid> `

Then schedule it via the Hermes cron config:

`yaml crons:

skill: heartbeat

schedule: "/2 *" `

Configure Healthchecks to alert your WhatsApp via the integrations panel. Cost: £0.

Layer 3 — Log routing + retention

Hermes generates four log streams. All four need rotation + retention.

The streams:

Log	Source	Useful for
`~/.hermes/logs/agent.log`	Application logic, skill execution	Debugging skill behaviour
`~/.hermes/logs/auto-update.log`	The auto-update script	Verifying nightly updates worked
`~/.hermes/whatsapp/bridge.log`	Baileys WhatsApp bridge	WhatsApp connection issues
`journalctl --user -u hermes-gateway`	systemd journal for the gateway	Crash diagnosis, restart counting

Configure logrotate at /etc/logrotate.d/hermes:

` ~/.hermes/logs/*.log { daily rotate 14 compress delaycompress missingok notifempty copytruncate }

~/.hermes/whatsapp/bridge.log { daily rotate 14 compress delaycompress missingok notifempty copytruncate } `

The copytruncate directive matters — without it, logrotate's default rename-then-create breaks long-lived processes that have file handles open on the original log.

Layer 4 — Auto-update without breaking things

Hermes ships updates roughly weekly. Auto-updating without a verification step has burned us. Auto-updating with a verification step is fine.

The script at ~/.hermes/bin/auto-update.sh (~100 lines of bash):

`bash

#!/bin/bash set -euo pipefail

LOG=/home/ubuntu/.hermes/logs/auto-update.log STATUS=/home/ubuntu/.hermes/.auto-update-status.json BACKUP_DIR=/home/ubuntu/.hermes/backups TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)

log() { echo "[$TIMESTAMP] $*" | tee -a "$LOG"; }

AVAIL=$(df ~/.hermes | awk 'NR==2 {print $4}') if [ "$AVAIL" -lt 524288 ]; then log "ERROR: less than 512 MB free, skipping update" exit 1 fi

curl -fsS http://localhost:9119/send -d '{"message":"Hermes auto-update FAILED, rolled back."}' exit 1 fi

if [ -f ~/.hermes/bin/auto-update-patch3.py ]; then python3 ~/.hermes/bin/auto-update-patch3.py

grep -q "**kwargs" ~/.hermes/hermes-agent/path/to/patched/file.py || { log "ERROR: patch3 reapplication failed" exit 1 } fi

ls -t "$BACKUP_DIR"/pre-update-*.zip | tail -n +8 | xargs -r rm

echo "{\"status\":\"success\",\"timestamp\":\"$TIMESTAMP\",\"pre_rev\":\"$PRE_REV\"}" > "$STATUS" log "Update complete." `

Triggered by a systemd timer at ~/.config/systemd/user/hermes-auto-update.timer:

`ini [Unit] Description=Daily Hermes auto-update

[Timer] OnCalendar=--* 03:00:00 Europe/London Persistent=true RandomizedDelaySec=600

[Install] WantedBy=timers.target `

The Persistent=true is critical — if the server was offline at 03:00, the update runs as soon as it comes back online. Without it, an overnight reboot means a full day of skipped updates.

The script has been live-tested on a 98-commit upgrade in early May 2026 (Hermes v0.12.0 → v0.13.0 with config v22 → v23 migration). It worked.

Layer 5 — The recovery playbook

Print this. Tape it next to the laptop you SSH from. When something goes wrong, you don't want to be reading documentation.

` HERMES RECOVERY PLAYBOOK — v1.0 (May 2026) ==========================================

Confirm the agent is actually down

ssh ubuntu@<server> systemctl --user status hermes-gateway.service

Try the simple restart

systemctl --user restart hermes-gateway.service sleep 30 systemctl --user is-active hermes-gateway.service

If restart fails, check the model provider

curl -fsS https://status.anthropic.com/api/v2/status.json

If the provider is healthy and Hermes still won't start, check logs

journalctl --user -u hermes-gateway.service --since "30 minutes ago"

Rollback to the most recent backup

ls -t ~/.hermes/backups/pre-update-*.zip | head -1

~/.hermes/hermes-agent/venv/bin/hermes import --force <filename> systemctl --user restart hermes-gateway.service

If rollback also fails, file an issue

https://github.com/NousResearch/hermes-agent/issues

While you wait for fix

systemctl --user disable --now hermes-auto-update.timer

KEY LOCATIONS

Hermes home: ~/.hermes/
Backups: ~/.hermes/backups/
Logs: ~/.hermes/logs/, journalctl --user -u hermes-gateway
Config: ~/.hermes/config.yaml
Auto-update script: ~/.hermes/bin/auto-update.sh

The playbook has been tested in anger twice in the first 40 days (the 30 April outage + one minor model-provider outage in May). Both times it recovered the agent in under 10 minutes.

What to monitor — the metrics that matter

Five metrics, no more. Over-monitoring creates alert fatigue and distracts from the signals that matter.

Metric	How to check	Alert threshold
Gateway uptime	Healthchecks.io heartbeat (above)	No ping for 10 min
Disk space	`df ~/.hermes` daily	<10% free
Memory usage	`ps aux	grep hermes-gateway	awk '{print $6}'` every 5 min	>4 GB on a 6 GB server
Model-provider error rate	Parse agent.log for HTTP 4xx/5xx, last hour	>10% errors
Auto-update status	`cat ~/.hermes/.auto-update-status.json` daily	status != "success" yesterday

For most UK SME deployments running a single Hermes instance, the five above are enough. Add more only when a real incident teaches you the gap.

Frequently asked questions

Why `Restart=always` instead of `Restart=on-failure`?

How often does Hermes ship breaking updates?

What about Kubernetes / Docker Swarm / Nomad?

How do I know if my Healthchecks alert is reaching me?

Can I run the auto-update script less often than daily?

What happens if the auto-update script itself has a bug?

Is there a hosted Hermes monitoring product?

How do I monitor model-provider costs?

↑ How to Deploy Hermes Agent — UK Business Complete Guide — the foundational deployment pillar
↔ Hermes Agent on Oracle Cloud Free Tier — UK Guide — the underlying server platform
↔ Hermes Agent Security & GDPR for UK Business — the compliance posture that complements operational reliability
↔ Hermes Agent — Real Business Use Cases — the use cases that depend on this reliability layer

What should you do next?

See how Ampliflow runs Hermes-powered automations for clients →

Or to scope your specific monitoring + recovery setup: Book a free Hermes deployment review →

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

The 30 April 2026 outage — what it taught us

Layer 1 — systemd unit configuration

Layer 2 — Healthchecks.io for offsite alerting

Layer 3 — Log routing + retention

Layer 4 — Auto-update without breaking things

Layer 5 — The recovery playbook

What to monitor — the metrics that matter

Frequently asked questions

Why `Restart=always` instead of `Restart=on-failure`?

How often does Hermes ship breaking updates?

What about Kubernetes / Docker Swarm / Nomad?

How do I know if my Healthchecks alert is reaching me?

Can I run the auto-update script less often than daily?

What happens if the auto-update script itself has a bug?

Is there a hosted Hermes monitoring product?

How do I monitor model-provider costs?

What should you do next?

Get help setting up your Hermes agent

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

The 30 April 2026 outage — what it taught us

Layer 1 — systemd unit configuration

Layer 2 — Healthchecks.io for offsite alerting

Layer 3 — Log routing + retention

Layer 4 — Auto-update without breaking things

Layer 5 — The recovery playbook

What to monitor — the metrics that matter

Frequently asked questions

Why `Restart=always` instead of `Restart=on-failure`?

How often does Hermes ship breaking updates?

What about Kubernetes / Docker Swarm / Nomad?

How do I know if my Healthchecks alert is reaching me?

Can I run the auto-update script less often than daily?

What happens if the auto-update script itself has a bug?

Is there a hosted Hermes monitoring product?

How do I monitor model-provider costs?

What should you do next?

Get help setting up your Hermes agent

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

The 30 April 2026 outage — what it taught us

Layer 1 — systemd unit configuration

Layer 2 — Healthchecks.io for offsite alerting

Layer 3 — Log routing + retention

Layer 4 — Auto-update without breaking things

Layer 5 — The recovery playbook

What to monitor — the metrics that matter

Frequently asked questions

Why Restart=always instead of Restart=on-failure?

How often does Hermes ship breaking updates?

What about Kubernetes / Docker Swarm / Nomad?

How do I know if my Healthchecks alert is reaching me?

Can I run the auto-update script less often than daily?

What happens if the auto-update script itself has a bug?

Is there a hosted Hermes monitoring product?

How do I monitor model-provider costs?

Related reading

What should you do next?

Get help setting up your Hermes agent

Hermes Agent Down? Monitoring, Logs, systemd & Production Recovery (2026)

The 30 April 2026 outage — what it taught us

Layer 1 — systemd unit configuration

Layer 2 — Healthchecks.io for offsite alerting

Layer 3 — Log routing + retention

Layer 4 — Auto-update without breaking things

Layer 5 — The recovery playbook

What to monitor — the metrics that matter

Frequently asked questions

Why Restart=always instead of Restart=on-failure?

How often does Hermes ship breaking updates?

What about Kubernetes / Docker Swarm / Nomad?

How do I know if my Healthchecks alert is reaching me?

Can I run the auto-update script less often than daily?

What happens if the auto-update script itself has a bug?

Is there a hosted Hermes monitoring product?

How do I monitor model-provider costs?

Related reading

What should you do next?

Get help setting up your Hermes agent

Why `Restart=always` instead of `Restart=on-failure`?

Why `Restart=always` instead of `Restart=on-failure`?