monitoring is the single point of execution for the monitoring scripts on
the VPS. If it goes down the scheduled checks do not run and alerts go silent.
This runbook covers the operations on-call needs to recover from common
failures.
All command examples assume you are SSH'd into the VPS as the deploy user (the
same account the systemd unit runs as), with the repository checked out at
/srv/monitoring.
No Docker, no compose, no Caddy, no reverse proxy. One systemd unit runs
supercronic, which executes
python -m automation run <profile> on each cron tick per
automation/jobs.yaml.
systemctl status monitoringSee the schedule supercronic is actually running (rendered from jobs.yaml):
cd /srv/monitoring
uv run python -m automation list # profiles + cron + task counts
uv run python -m automation render-crontab # the exact crontab supercronic runssupercronic logs every job start and exit to journald (see below) — that's the record of what ran and when.
journalctl -u monitoring -f --since '15 min ago'supercronic prints a structured line per job (channel=stdout, job.position,
exit status). Filter to a single profile run by grepping the lock command:
journalctl -u monitoring --since 1h | grep 'automation run hourly'The Python tasks themselves log at LOG_LEVEL (default INFO). A profile with
failing tasks also posts one Telegram digest (see Failure handling below), so
you usually learn about a failure there before reading journald.
Useful when a tick was skipped, or to validate a hand-edited config. Replaces
workflow_dispatch.
cd /srv/monitoring
# Print what would run without executing:
uv run python -m automation run hourly --dry-run
# Actually run it (sends real alerts / Telegram on failure):
uv run python -m automation run hourlyAvailable profiles: hourly, daily, multisig (see automation/jobs.yaml).
Contributors merge to main through PRs; nobody normally SSHes in to ship code.
Two things move main onto the box, split on "does the change need the
scheduler restarted to take effect?"
Auto-sync (no restart, the common case). The multisig profile runs
git pull --ff-only before its tasks every 10 minutes (sync_before_run: true
in jobs.yaml). Since supercronic re-spawns every profile fresh against the
on-disk tree, that one pull keeps the whole checkout current for all profiles
within ~10 minutes — every other profile rides along for free. The pull is
best-effort: a failure is logged (journalctl -u monitoring | grep 'pre-run git sync') and the multisig checks run against the existing checkout anyway, so a
git hiccup never silences an alert. This covers scripts, config modules, data,
and jobs.yaml task bodies — anything read fresh by a subprocess.
So for the common case — a script tweak, a new task in a profile — merge the PR and the next ~10-min multisig tick pulls it in; no SSH needed.
Manual restart (cadence + deps). Two kinds of change land on disk via the auto-sync but stay inert until a restart, because they're read once at scheduler boot, not per-tick:
- a
cron:cadence change injobs.yaml(the crontab is rendered at unit start byExecStartPre), and - a
pyproject.toml/uv.lockchange (the venv).
For those, after the PR merges:
cd /srv/monitoring
git pull --ff-only # or just let the next multisig tick land it
uv sync --frozen --extra ai # only if pyproject.toml / uv.lock changed (--extra ai: openai client for the AI explainer)
sudo systemctl restart monitoring # re-renders the crontab and re-points supercronic at the treeThe restart re-renders the crontab and points supercronic at the freshly-pulled tree. Total downtime is a few seconds and only affects the scheduler, not an in-flight job.
Secrets live in /etc/monitoring/.env (mode 0640, root:),
loaded by the unit via EnvironmentFile. They are not in git. Edit in place
and restart:
sudo $EDITOR /etc/monitoring/.env
sudo systemctl restart monitoringThe restart re-reads the env and re-renders the crontab. Anyone who could read the file has seen the old values, so a leaked credential should be rotated at the provider, not just edited here.
To validate the whole fleet without spamming production chats — e.g. comparing
this VPS's output against the old GitHub Actions runs — set
TELEGRAM_TEST_CHAT_ID in the env. While it is set, every alert from every
protocol is sent to that single chat via the default bot, prefixed with a
[protocol] label and with no topic threading, so production routing
(TELEGRAM_TOPIC_ID_* / per-protocol chats) is bypassed entirely:
sudo $EDITOR /etc/monitoring/.env
# TELEGRAM_TEST_CHAT_ID=-1001234567890 # the dummy group
# (the default bot, TELEGRAM_BOT_TOKEN_DEFAULT, must be a member of it)
sudo systemctl restart monitoringKeep LOG_LEVEL=INFO (the default) — LOG_LEVEL=DEBUG skips all Telegram sends,
so nothing would arrive. Comment the line out and restart to restore normal
per-protocol routing.
Single-node failover (cattle, not pets):
- Provision a fresh VPS:
sudo bash deploy/install.sh(installs uv/Python/ supercronic, clones the repo, creates the venv, installs the systemd unit). It can also be curled — see the header ofinstall.sh. - Drop the production env at
/etc/monitoring/.env(mode 0640, root:) — copy it from the old host or recreate from.env.example. - Start it:
sudo systemctl enable --now monitoring systemctl status monitoring - Confirm the schedule and the first ticks:
cd /srv/monitoring && uv run python -m automation render-crontab journalctl -u monitoring -f
- Stop the old host so alerts aren't sent twice:
sudo systemctl disable --now monitoring
systemd restarts the unit on crash (Restart=on-failure, capped at 5
restarts/60s). There is no HTTP healthcheck (and no daemon to wedge): supercronic
is a foreground process whose death systemd observes directly. If supercronic is
running, the crontab is being ticked. If a single job hangs it blocks only its
own profile (each is wrapped in flock -n, so the next tick of that profile is
skipped, not queued) — the others keep ticking.
| Symptom | First thing to check | Likely cause |
|---|---|---|
Active: failed on start |
journalctl -u monitoring -n 50 |
Malformed automation/jobs.yaml (render-crontab aborts the start), or /etc/monitoring/.env missing (the unit refuses to start without it). |
| Telegram suddenly silent | Is TELEGRAM_BOT_TOKEN_DEFAULT valid? LOG_LEVEL=DEBUG skips sends. |
Bot revoked, chat removed bot, or LOG_LEVEL left at DEBUG. |
| One profile never runs | uv run python -m automation render-crontab — is its line present? |
Profile/task enabled: false in jobs.yaml, or its flock lock is stuck held by a hung run (restart clears it). |
ModuleNotFoundError after a deploy |
journalctl -u monitoring -n 50 |
Forgot uv sync --frozen after a pyproject.toml/uv.lock change. |
| Cache/dedupe acting up | ls -l /srv/cache |
Wrong perms (must be writable by the runner user) or a corrupt cache file — safe to delete; it re-seeds. |
- Source tree:
/srv/monitoring(owned by the deploy user). - Python venv:
/srv/monitoring/.venv(created byuv sync). - Cache / dedupe state:
/srv/cache(owned by the deploy user; the unit grants it viaReadWritePathsand setsCACHE_DIR=/srv/cache, whichutils.cacheresolves every cache file against). A profile only overrides a cache basename inautomation/jobs.yamlwhen it needs an isolated file (e.g. daily). - Env file:
/etc/monitoring/.env(mode 0640, root:; operator-supplied, not in git). - systemd unit:
/etc/systemd/system/monitoring.service. - Rendered crontab:
/tmp/crontab(per-servicePrivateTmp; regenerated on every start). - Code sync: no separate unit — the
multisigprofile's pre-rungit pull --ff-only(sync_before_runinjobs.yaml) every 10 min. The unit grants the repo write access viaReadWritePaths=/srv/cache /srv/monitoringso the pull can update.git/underProtectSystem=strict.