Skip to content

feat(agent): detect availability via session/new probe and assistant-first identity#500

Merged
kaizhou-lab merged 148 commits into
mainfrom
feat/agent-connection-testing-phase2
Jun 25, 2026
Merged

feat(agent): detect availability via session/new probe and assistant-first identity#500
kaizhou-lab merged 148 commits into
mainfrom
feat/agent-connection-testing-phase2

Conversation

@kaizhou-lab

@kaizhou-lab kaizhou-lab commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR started as agent connection testing phase 2, but it now also completes the backend side of the assistant-first identity migration. The combined scope is: accurate agent availability detection, assistant/agent identity unification, cron assistant-first persistence, and a clearer /api/assistants contract.

Agent availability detection

  • Probe real session/new startup instead of inferring availability from static metadata.
  • Report online/offline status with explicit auth-required handling and clear stale needs_auth state after successful startup.
  • Gate aionrs availability on resolved provider configuration instead of backend labels alone.
  • Reap probe process trees / process groups to avoid orphaned wrapper children.
  • Expose management and logo metadata needed by AionUi for repair/status surfaces.

Assistant-first identity and schema cleanup

  • Make assistant identity the public write boundary for conversation, team, channel, cron, guide, and ACP-session flows.
  • Use agent_metadata.id as the canonical concrete agent binding behind assistants.
  • Normalize assistant definitions, snapshots, and runtime resolution around assistant id plus agent_metadata.id instead of ambiguous backend/preset fields.
  • Rename response/runtime backend metadata for ACP agents from generic backend to explicit acp_backend.
  • Remove duplicated nested agent.id from /api/assistants; the top-level agent_id is the only concrete agent binding in the response.
  • Keep nested /api/assistants.agent as runtime metadata only: type, source, and optional acp_backend.

Cron jobs

  • Move cron persistence to assistant-first agent_config.assistant_id.
  • Derive runtime backend, ACP backend, provider/model behavior, name, and avatar from the assistant at execution time.
  • Remove legacy cron_jobs.agent_type / agent_config.backend dependence from current write/read paths.
  • Migrate existing branch data in the existing phase-2 migration path instead of adding a new migration.
  • Preserve model/provider data where it is actual execution configuration, while deriving agent backend identity from assistant_id.

Team, conversation, channel, and ACP session flows

  • Reject legacy backend/custom-agent/preset fields in public assistant-first write DTOs while canonicalizing legacy read paths where needed.
  • Carry assistant identity through team creation, add-agent, snapshots, guide handoff, and MCP tool contracts.
  • Persist assistant snapshots for conversations and use them to resolve runtime backend safely.
  • Use assistant-owned defaults and runtime seeds when creating ACP/aionrs sessions.
  • Keep channel defaults and settings bound to assistants rather than backend labels.

Tests

  • Added/updated API type, assistant service, cron, team, conversation, channel, migration, and e2e coverage for the new identity contract.
  • Added assertions that /api/assistants exposes agent_id plus agent.acp_backend, and no longer exposes nested agent.id or agent.backend.

Testing

  • just push passed after merging latest origin/main.
  • AionCore: migration immutability check, cargo check, clippy/fmt, and workspace nextest passed.
  • Nextest result: 6506 passed, 18 skipped.

Closes #499

zk added 30 commits June 15, 2026 17:57
The availability scheduler runs `try_connect_custom_agent` every 5
minutes for every agent, spawning a CLI subprocess and tearing it
down once the ACP handshake completes (or fails). For wrapper CLIs
that fork a long-lived grandchild — `npm exec openclaw --acp` is the
production case — cleanup was leaking the grandchild because:

* `kill_on_drop(true)` on the tokio Command only signals the direct
  child (the npm exec wrapper), not its grandchild.
* The probe relied on `drop(protocol)` for the success path and on
  no explicit cleanup for the handshake-fail path, so `proc.kill`
  was never called.
* `CliAgentProcess::kill` itself short-circuited and returned Ok the
  moment the leader exited within the grace period — so even when
  callers did invoke it, no group-wide SIGKILL was sent.

Result: dozens of zombie `openclaw-acp` processes accumulated per
day under the 5-minute scheduler.

Fix:

1. `CliAgentProcess::kill` now always issues a group-wide SIGKILL
   after the grace period, even when the leader has already exited.
   `force_kill` already maps ESRCH to success, so the sweep is
   idempotent for already-reaped trees.
2. `try_connect_custom_agent` calls `proc.kill` on every outcome
   (success, ACP failure, handshake timeout) by hoisting the spawn
   out of the inner future and running cleanup unconditionally
   after the timeout race resolves.
3. New regression test `probe_kills_grandchild_left_behind_by_wrapper`
   exercises the exact wrapper-grandchild shape from production and
   asserts the grandchild is reaped before the probe returns.
zk added 25 commits June 22, 2026 16:32
…ficeAI/AionCore into feat/agent-connection-testing-phase2

* 'feat/agent-connection-testing-phase2' of github.com:iOfficeAI/AionCore:
…ackend

An assistant's agent_status was matched to its agent row by `backend`
only. aionrs (the built-in Rust agent) has a NULL `backend` and is keyed
by `agent_type` ("aionrs"), so every aionrs-backed assistant failed to
resolve a row and was mislabelled Missing/unavailable.

Match the agent row on `backend == effective_backend` OR
`agent_type.serde_name() == effective_backend`, so aionrs assistants
resolve to the real aionrs row and reflect its actual status.

Add a regression test covering an aionrs assistant (row with NULL backend,
agent_type Aionrs, Online) resolving to Online instead of Missing.
…-testing-phase2

# Conflicts:
#	crates/aionui-conversation/src/service.rs
…-testing-phase2

# Conflicts:
#	crates/aionui-conversation/src/service.rs
#	crates/aionui-conversation/src/service_test.rs
#	crates/aionui-conversation/src/turn_orchestrator.rs
#	crates/aionui-cron/tests/service_integration.rs
#	crates/aionui-team/src/test_utils.rs
#	crates/aionui-team/tests/session_service_integration.rs
@kaizhou-lab kaizhou-lab enabled auto-merge (squash) June 25, 2026 09:14
@kaizhou-lab kaizhou-lab merged commit 6c9a721 into main Jun 25, 2026
6 checks passed
@kaizhou-lab kaizhou-lab deleted the feat/agent-connection-testing-phase2 branch June 25, 2026 09:27
kaizhou-lab pushed a commit that referenced this pull request Jun 25, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.37](v0.1.36...v0.1.37)
(2026-06-25)


### Features

* **agent:** detect availability via session/new probe and
assistant-first identity
([#500](#500))
([6c9a721](6c9a721))
* **conversation:** add cursor pagination for messages
([#515](#515))
([ba76273](ba76273))


### Bug Fixes

* **agent:** classify ACP and provider errors
([#518](#518))
([ef573d0](ef573d0))
* **aionrs:** adapt runtime guard config
([#510](#510))
([464f453](464f453))
* **conversation:** recover dead ACP turns after agent process loss
([#514](#514))
([e0ce4f4](e0ce4f4))
* **db:** repair legacy handoff schema drift
([#516](#516))
([292e5f2](292e5f2))
* validate skill frontmatter as yaml
([#512](#512))
([6b46055](6b46055))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect agent availability: probe session/new, recognize auth_required, gate aionrs by provider

1 participant