Skip to content

feat: 温度 / 事件告警(webhook · 命令 · 日志)#2

Merged
harryisfish merged 2 commits into
mainfrom
feat/alerts
Jun 9, 2026
Merged

feat: 温度 / 事件告警(webhook · 命令 · 日志)#2
harryisfish merged 2 commits into
mainfrom
feat/alerts

Conversation

@harryisfish

Copy link
Copy Markdown
Contributor

做了什么

把告警挂在 daemon 已有的每秒温度监控循环上(护栏就靠它),所以数据采集成本为零。第一版三类触发源,数据全在手里:

  • temp —— 传感器持续超阈值
  • guard —— 风扇安全护栏触发(被强制恢复 auto)
  • write-error —— SMC 写回读校验失败

动作三种:webhook(HTTP POST)/ exec(跑命令)/ log

[[alert]]
name = "cpu-hot"
on = "temp"
sensor = "Tp09"     # 或 "any" 取最热
above = 85
for = 30            # 持续 30s 才触发(去抖)
cooldown = 300      # 触发后静默 300s
resolve = true      # 恢复时再发一条
action = "webhook"
url = "http://gotify.lan/message?token=..."
$ smctl alert status        # 每条规则状态(armed/pending/cooling/firing) + 最近事件
$ smctl alert test cpu-hot  # 立即触发动作,验证 webhook/脚本

架构要点

关注点 处理
去抖 AlertEngine(PolicyEngine,纯逻辑边沿状态机):for 持续时长 + cooldown + 边沿触发,now 注入可测
动作隔离 AlertActionRunner独立队列跑,绝不阻塞 1Hz 循环;exec/webhook 都有超时;并发闸满了直接丢弃而非堆积;失败只记日志不传播
配置不丢 writeConfig 序列化 [[alert]] —— 否则一次 battery/fan 写入会把用户告警抹掉(已加回归测试)
fanless Mac tick 重构为「读一次温度 → 风扇子系统 → 告警」,告警在无风扇机型上也能跑

安全 / 隐私

  • exec 以 root 运行(daemon 是 root)。信任边界 = 能编辑 root 拥有的 /etc/smctl/config.toml 的人,与 allow_below_minimum 完全一致。命令只走 argv 数组(不经 shell),无注入面。
  • webhook 是第二类出站请求(opt-in)。README / zh-CN 的隐私承诺已相应更新(原来写的是 "exactly one outbound request")。

测试

  • 引擎 12 例:超阈值 / 去抖 / 去抖重置 / cooldown / resolve / any 传感器 / guard / write-error / 状态生命周期 / 规则移除后状态清理
  • daemon:[[alert]] TOML 解码 + writeConfig 保留告警回归
  • 真实子进程 exec 集成测试(spawn /usr/bin/touch 验证 argv 占位符替换)
  • 全套 65 tests 绿

未做(诚实说明)

为不干扰用户已安装的生产 daemon,没有跑「装新 daemon + 真实 webhook」的线上 E2E。XPC 管线是机械镜像现有可用方法,加上引擎/配置/exec 的单测,作为安全网。合并后建议在干净环境做一次真机 webhook 验收。

后续

第 3 步(alert list 等增强)与第 4 步(降频转告警、电源策略写入)见路线图。降频数据来自 #1power status

Alert rules hang off the daemon's existing 1 Hz temperature loop, so the data
cost is zero. Three triggers for the first cut, all from data already in hand:
temp thresholds, the fan safety guard tripping, and SMC write failures.

- AlertEngine (PolicyEngine): pure, edge-triggered state machine per rule —
  for-duration debounce, cooldown, optional resolve. now injected for tests.
- AlertActionRunner (daemon): runs actions OFF the state queue, every exec and
  webhook bounded by a timeout, a concurrency cap drops rather than piles up,
  failures are logged not propagated. exec uses argv arrays only (no shell) and
  also exposes SMCTL_ALERT_* env vars.
- config.toml gains [[alert]] tables; writeConfig round-trips them so a battery/
  fan write never silently drops the user's alerts (regression-tested).
- XPC: getAlertStatus + testAlert; CLI: `smctl alert status` / `alert test`.

Security: exec runs as root; the trust boundary is whoever can edit the
root-owned config — same as allow_below_minimum. Privacy: webhooks are a second
(opt-in) outbound case; README/zh-CN updated to say so.

Tests: 12 engine cases (debounce/cooldown/resolve/triggers/state purge), TOML
decode + writeConfig preservation, and a real-subprocess exec integration test.
Live install/webhook E2E intentionally not run to avoid disrupting the user's
installed daemon; unit coverage is the safety net.
Keep daemon status errors separate from alert write-error state so general
runtime issues do not fire SMC write failure alerts. Track SMC write failures
from battery and fan write paths, clear the signal after a successful write,
and cover both false-positive and recovery behavior in daemon tests.
@harryisfish harryisfish merged commit 4fd2755 into main Jun 9, 2026
1 check passed
@leaperone-bot leaperone-bot deleted the feat/alerts branch June 9, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants