Skip to content

Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns#120

Open
tdpham2 wants to merge 123 commits into
devfrom
dev-globus
Open

Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns#120
tdpham2 wants to merge 123 commits into
devfrom
dev-globus

Conversation

@tdpham2

@tdpham2 tdpham2 commented May 4, 2026

Copy link
Copy Markdown
Collaborator

What this does

This PR makes ChemGraph's compute layer portable across HPC systems and adds a multi-agent campaign runtime on top of it. It has three parts:

  1. Pluggable execution backends — decouple MCP servers from any single workflow manager.
  2. Backend-agnostic HPC MCP servers — MACE / gRASPA / XANES that submit work through the backend abstraction, with async job tracking and Globus file staging.
  3. Academy campaigns + dashboard — persistent multi-agent screening across nodes, with live event observability.

The original *_parsl.py servers and the ase_core tool layer are untouched.

1. Execution backends (src/chemgraph/execution/)

A clean ExecutionBackend interface with four implementations:

  • LocalProcessPoolExecutor, zero external deps, good for dev/CI
  • Parsl — system-aware configs (local, Polaris, Aurora, Crux)
  • EnsembleLauncher — cluster-mode API for dynamic task submission
  • Globus Compute — submit to a persistent remote endpoint with no active allocation

Key modules:

  • base.pyExecutionBackend (ABC) + TaskSpec (unified Python-callable / shell task with resource hints). Backends expose is_async_remote and shares_filesystem so callers adapt without knowing the implementation.
  • config.pyget_backend() / get_transfer_manager() factories. Selection hierarchy: explicit arg > env var (CHEMGRAPH_EXECUTION_BACKEND, COMPUTE_SYSTEM) > config.toml [execution] > "local".
  • local_backend.py, parsl_backend.py, ensemble_launcher_backend.py, globus_compute_backend.py
  • utils.py — structure-file resolution, async future gathering (bridges concurrent.futures.Future to asyncio), JSONL result writing
  • job_tracker.py — thread-safe tracking of async-remote batches, persisted to JSON so results survive across sessions
  • globus_transfer.pyGlobusTransferManager stages files between collections instead of embedding large structures in payloads

hpc_configs/loader.py replaces the duplicated load_parsl_config() helpers, dispatching to per-system factories (local_parsl.py, aurora_parsl.py, polaris_parsl.py, crux_parsl.py) and resolving worker_init from env > submitting env > system fallback.

2. HPC MCP servers (src/chemgraph/mcp/)

  • cg_fastmcp.pyCGFastMCP (FastMCP subclass) submits tool calls to the configured backend as TaskSpec objects; centralizes transport concerns (inline embedding vs. remote paths) and adds a fanout decorator for ensemble tools.
  • mace_mcp_hpc.py, graspa_mcp_hpc.py, xanes_mcp_hpc.py — build TaskSpec and submit via get_backend() instead of calling Parsl directly.
  • job_tools.py — registers check_job_status / get_job_results / list_jobs / cancel_job / check_endpoint_status on any server.
  • transfer_tools.py — registers transfer_files / check_transfer_status / list_remote_files (Globus staging).
  • hpc_misc_mcp.py — generic JSON artifact inspection.

Schemas now support pre-staged HPC inputs: remote_structure_directory on the MACE/gRASPA ensemble schemas lets workers read directly from a remote path instead of embedding structures inline.

3. Academy multi-agent campaigns (src/chemgraph/academy/)

A runtime for persistent multi-agent screening campaigns over MPI with Redis-backed messaging:

  • core/ — logical agent wrapping the ChemGraph turn primitive, JSONC campaign specs with per-agent allowed_tools whitelists, peer-messaging action tools.
  • runtime/ — MPI daemon, multi-node compute launcher, and a local dashboard launcher that mirrors remote run state and relays the Argo endpoint.
  • observability/ — append-only JSONL event log + run-status artifacts (tolerant parsing for live polling).
  • dashboard/ — stateless HTTP server + web UI reading from a run directory.

For single-agent CLI runs, --trace-dir (via cli/trace.py) emits the same event schema so runs are viewable in the dashboard. New CLI subcommands wire up dashboard, academy run-compute, academy mpi-daemon, academy dashboard, and academy campaigns.

Agent event plumbing was refactored: event translation lives in agent/events.py, the turn loop (with on_event / terminal_tool_names hooks) in agent/turn.py, keeping llm_agent.py clean for CLI users.

Supporting changes

  • Models — new models/settings.py (LLMSettings) centralizes endpoint config; openai.py normalizes Argo model names by endpoint type (local shim vs. hosted wire format, overridable via CHEMGRAPH_ARGO_MODEL_FORMAT).
  • Tools — thread-safe MACE calculator loading (avoids torch symbolic_trace crashes under parallel workers); output parent directories created before writing in ase_core.py and cheminformatics_core.py.
  • Bug fixesaurora_parsl.py unreachable code + misleading error removed; EnsembleLauncherBackend.shutdown() no longer leaves _initialized=True on partial failure.

New dependencies (all optional extras)

ensemble_launcher = ["ensemble-launcher"]
globus_compute   = ["globus-compute-sdk"]
academy          = ["academy-py", "httpx", "redis"]

The local backend works with zero additional installs.

Testing

New suites cover execution backends, job tracking, model normalization, tool-adapter validation, MCP discovery, and the Academy campaign/runtime/dashboard paths:

  • test_execution.py, test_job_tracker.py
  • test_openai_model_normalization.py, test_tool_adapter_validation.py, test_mcp.py
  • test_academy_*.py (campaign, mcp_supervisor, reasoning, dashboard, launcher, exchange, payloads)

Live Globus Compute integration tests are opt-in via --run-globus-compute. Backend-specific machine tests (Polaris/Aurora Parsl, EnsembleLauncher multi-node) are skipped when their dependencies/allocations are unavailable.

Test plan

  • pytest tests/ — execution, job-tracker, model, MCP, and Academy suites pass; existing tests green
  • Import chain (from chemgraph.execution import get_backend, TaskSpec)
  • Live Globus Compute endpoint (--run-globus-compute)
  • Parsl backend on Polaris/Aurora
  • EnsembleLauncher on a multi-node allocation
  • End-to-end Academy campaign (example-002-mace-ensemble-screening) + dashboard

@tdpham2 tdpham2 mentioned this pull request May 7, 2026
tdpham2 added 2 commits May 7, 2026 16:20
…us Compute

Introduce a unified execution module with an abstract ExecutionBackend
interface and TaskSpec model, supporting four backends: local
(ProcessPoolExecutor), Parsl, EnsembleLauncher, and Globus Compute.

Includes config factory with resolution order (args > env > config.toml),
HPC configs loader, comprehensive tests, and pytest --run-globus-compute
option for live endpoint tests.
Remove dead num_nodes=1 after raise in aurora_parsl.py and fix
misleading error message. Set _initialized=False at start of
EnsembleLauncherBackend.shutdown() to prevent submitting to a
partially torn-down backend.
tdpham2 and others added 15 commits May 14, 2026 13:42
When backend=globus_compute, MCP tools now return immediately after
submitting jobs to the remote HPC endpoint instead of blocking until
completion. A new JobTracker tracks submitted futures across tool calls,
and new MCP tools (check_job_status, get_job_results, list_jobs,
cancel_job) let the LLM agent poll for progress and retrieve results.

Non-Globus backends (local, Parsl, EnsembleLauncher) are unchanged and
continue to block until results are ready.

Key changes:
- Add is_async_remote property to ExecutionBackend (True for Globus)
- Add check_endpoint_status() health check to GlobusComputeBackend
- Add JobTracker with batch registration, status, results, cleanup
- Add submit_or_gather() utility that branches on backend type
- Add optional timeout parameter to gather_futures()
- Add register_job_tools() to wire job tools into any MCP server
- Integrate tracker into MACE, XANES, and gRASPA MCP servers
- Add CGFastMCP: FastMCP subclass with integrated execution backend,
  lazy init, built-in job tools, @tool() and @ensemble_tool() decorators
- Refactor EnsembleLauncherBackend with client-only mode (shared
  orchestrator via checkpoint_dir) and managed mode
- Update get_backend() to route client_only vs managed EL initialization
- Rewrite mace_mcp_hpc.py to use CGFastMCP decorators
- Clean up parsl_tools.py: remove dead code, use stdlib logging
- Fix __main__ pickle issue via _fix_module_for_pickle + sys.modules alias
- Add client-only mode demo cell to notebook 3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Modified the EL backend implemenations, and added a EL backend test
…mport

- parsl_tools.run_mace_core: stop swallowing exceptions and returning None.
  run_ase_core already returns a structured failure dict on simulation
  errors, and programmer errors should propagate.
- cg_fastmcp.ensemble_tool: raise TypeError with a clear message when the
  decorated function does not have exactly one parameter, instead of
  crashing with IndexError at decoration time.
- ensemble_launcher_backend: soft-import ensemble_launcher and defer the
  failure to construction / call time. SYSTEM_CONFIG_REGISTRY is now a
  lazy view backed by builder functions so the module loads cleanly
  without EL installed, restoring the deferred-error behaviour callers
  of chemgraph.execution.config expected.
- persist_file parameter: when set, batch metadata and Globus Compute
  task UUIDs are written to JSON after registration and after results
  are cached, and loaded on init. Allows MCP servers to recover job
  state across restarts.
- TrackedTask.globus_task_id and TrackedTask.future are both optional;
  loaded-from-disk batches have no in-memory Future and are queried
  via the Globus Compute Client directly in get_status.
- Lazy Globus Compute Client with a separate gc_lock for thread safety.
- _wait_for_globus_task_ids polls each ComputeFuture briefly after
  submission to capture the Globus task_id assigned asynchronously
  by the Executor background thread.
- cancel_batch / cleanup_old_batches handle the no-future case.
- init_backend now accepts tracker_kwargs= and forwards it to
  JobTracker(...) in _ensure_backend. Callers can pass
  persist_file= so MCP servers recover job state across restarts.
- set_pre_submit_hook(hook): hook receives each TaskSpec before
  backend.submit() and returns a (possibly mutated) one. Lets a
  server centralise transport concerns -- inline-structure embedding
  for local-submit-to-remote-worker, remote-path rewriting -- instead
  of repeating that logic in every tool body. Wired into the
  @tool, @ensemble_tool, and @schema_fanout_tool submit paths.
- @schema_fanout_tool(worker=...): the decorated function is an
  expander (ensemble schema -> list of per-item args). The framework
  calls worker(item) on the backend for each item and gathers
  results. Preserves the ensemble schema as the agent-facing API
  (one tool call, server-side fanout), complementing @ensemble_tool
  which exposes list[Schema] for callers that want client-side
  enumeration.
…ersistence

- mace_input_schema_ensemble / graspa_input_schema_ensemble: new
  remote_structure_directory field for pre-staged HPC files (paired
  with the upcoming transfer_files tool). input_structure_directory
  now defaults to empty string so callers can pass either.
- mace_input_schema/_ensemble model description spells out that
  'mace_mp' is the calculator type, not a model name -- LLMs were
  confusing the two.
- Nullable schema fields (driver, model, wall_time) typed as
  str|None / float|None for correct OpenAPI schema generation.
- GlobusComputeBackend._ensure_executor re-creates the Executor when
  it has been shut down (e.g. after a remote task failure). Uses
  getattr() so we don't depend on the SDK's private _stopped attr
  existing.
- check_endpoint_status logs exc_info on failure for easier debugging.
- xanes_mcp_hpc: JobTracker(persist_file=~/.chemgraph/xanes_jobs.json)
  so XANES job state survives MCP server restarts. Instructions
  updated to tell the LLM to surface batch_ids to the user.
- execution/globus_transfer.py: GlobusTransferManager wraps the
  globus_sdk TransferClient with token caching, batched
  transfer_files / wait_for_transfer / check_transfer_status /
  list_remote_directory. Lazy globus_sdk import, lazy auth.
- execution/config.get_transfer_manager(): builds a manager from
  [execution.globus_transfer] in config.toml with env var overrides
  (GLOBUS_TRANSFER_SOURCE_ENDPOINT_ID, _DESTINATION_ENDPOINT_ID,
  _DESTINATION_BASE_PATH). Returns None when not configured so MCP
  servers can skip registration silently.
- mcp/transfer_tools.register_transfer_tools(): registers
  transfer_files, check_transfer_status, list_remote_files on a
  FastMCP/CGFastMCP server. Uses mcp.add_tool() (not the
  backend-submitting @tool() decorator) because these are
  orchestration tools, not compute tasks -- they call the Globus
  Transfer API directly from the MCP server process.
- get_backend() globus_compute endpoint_id fallback now treats
  empty-string endpoint_id as unset, matching the GLOBUS_COMPUTE_
  ENDPOINT_ID env-var override behaviour.
…GFastMCP

run_mace_single and run_mace_ensemble were collapsed to bare
run_mace_core(params) calls in PR #127, dropping inline-structure
embedding, remote-path support, JobTracker persistence, and the
Globus Transfer registration that 51ba171 had built. This restores
all of that on top of the new CGFastMCP framework.

- Worker is now a separate function _mace_worker(job: dict) that
  handles two transport keys on the worker FS: remote_structure_file
  (use the path directly) and inline_structure (materialise an
  AtomsData dict to a temp XYZ). Embeds full_output back into the
  result for inline calls so callers do not need remote FS access.
- Pre-submit hook _mace_transport_hook centralises the schema ->
  job-dict conversion, mace_mp -> medium-mpa-0 model normalisation,
  and inline embedding (when the input file exists on the
  submitting host). Hook rewrites task.callable from run_mace_single
  to _mace_worker so the LLM still sees a clean schema-shaped tool.
- run_mace_ensemble switches to @schema_fanout_tool with a
  server-side expander, preserving the directory-driven UX
  (single LLM call instead of N). Local mode enumerates files via
  resolve_structure_files; remote mode submits a backend probe to
  ls remote_structure_directory and builds remote_structure_file
  per item.
- extract_output_json registered via mcp.add_tool() (orchestration,
  no backend wrap). transfer_files/check_transfer_status/
  list_remote_files registered conditionally when
  get_transfer_manager() finds [execution.globus_transfer] config.
- __main__ now wires tracker_kwargs={persist_file: ~/.chemgraph/
  mace_jobs.json} so MACE batches survive MCP server restarts.
- Drop `from __future__ import annotations`: forward refs break
  FastMCP's signature introspection because the wrapper's __globals__
  is cg_fastmcp's, not the tool module's.
Wraps the Academy distributed agent framework with ChemGraph LLM
agents for federated HPC screening workflows. Decoupled from the
existing pipeline -- no chemgraph.cli / chemgraph.agent / chemgraph.eval
references; only the lazily imported chemgraph.agent.llm_agent.ChemGraph.

- ChemGraphAgent: Academy Agent wrapping a single ChemGraph instance,
  exposes run_query / get_info actions.
- ScreeningAgent: iterates a molecule batch, writes per-result JSONs
  for fault-tolerant aggregation. Failed-molecule records now store
  str(exc) so the actual exception message survives.
- CoordinatorAgent: polls a results dir, optionally analyses results
  via an LLM, suggests follow-up molecules.
- AcademyConfig + build_manager: bridge config.toml to Academy
  Manager / Exchange / Launcher (local, Redis, Parsl, Globus Compute).
- RateLimiter: stdlib async token-bucket for shared per-provider LLM
  quotas across agents.

Lazy imports in __init__.py let the package load without the optional
academy-py dependency; ChemGraphAgent / ScreeningAgent / CoordinatorAgent
raise ModuleNotFoundError on access if academy-py is missing, while
AcademyConfig and RateLimiter remain usable.

pyproject's academy optional-dep + pytest marker are already in HEAD
(commit 04bcc8a). tests/test_academy.py and scripts/academy_example/
remain untracked and will land in follow-ups.
- globus_transfer.py: disambiguate same-basename inputs with a numeric
  suffix so two files that share a name (e.g. /a/in.cif and /b/in.cif)
  don't silently overwrite each other on the remote collection.
- job_tracker.py: promote the "no Globus task_id within timeout"
  message to a warning at submit time, and emit a per-task warning at
  reload time for batches restored without a task_id (those tasks
  cannot be queried via the Globus Compute API and would otherwise be
  silently orphaned across server restarts).
- globus_compute_backend.py: catch "executor stopped" exceptions in
  submit(), rebuild the Executor, and retry once. The previous
  _ensure_executor relied on the SDK's private _stopped attribute,
  which fails silently if the SDK exposes the shutdown state
  differently.
- cg_fastmcp.py: wrap _apply_pre_submit_hook in try/except and re-raise
  hook failures as a ValueError naming the hook and task_id so they
  surface as a structured tool error instead of an opaque traceback.
Both servers now mirror the mace_mcp_hpc.py pattern:
- CGFastMCP with lazy backend initialisation via init_backend(); the
  worker subprocesses re-importing the module no longer instantiate a
  backend at import time.
- Job-management tools (check_job_status, get_job_results, list_jobs,
  cancel_job, check_endpoint_status) are auto-registered by
  CGFastMCP._register_job_tools; the external register_job_tools call
  is dropped.
- __main__ wires init_backend(tracker_kwargs={"persist_file": ...})
  and pairs run_mcp_server with shutdown_backend in finally. This also
  closes a real bug in graspa_mcp_hpc.py, which was instantiating
  JobTracker() with no persist_file and silently losing job state
  across restarts despite the server's instructions promising
  persistence.
- Globus Transfer tools (transfer_files, check_transfer_status,
  list_remote_files) are registered on both servers when the transfer
  manager is configured, matching the existing MACE behaviour.
- gRASPA expander now supports remote_structure_directory the same way
  MACE does: a one-shot probe task lists CIFs on the remote endpoint
  and the worker reads them directly from the staged path.
- Ensemble flows use the schema_fanout_tool decorator; per-job structure
  metadata is propagated through the worker output (since the framework
  meta is only the index).

Legacy *_mcp_parsl.py modules now raise a DeprecationWarning at import
pointing to the *_hpc.py replacement; they remain functional because
scripts/mcp_xanes_example/ still imports xanes_mcp_parsl.
keceli added a commit that referenced this pull request Jun 2, 2026
- Update ChemGraph container and add Kubernetes deployment. Add Kubernetes deployment support #123
- Add execution layer to ChemGraph, including EnsembleLauncher, Parsl, Globus compute and Globus transfer. Add - pluggable execution backends and backend-agnostic HPC MCP servers #120 Modified the EL backend implemenations, and added a EL backend test #127
- Add initial Academy integration Add pluggable execution backends and backend-agnostic HPC MCP servers #120
- Updates package metadata to version 0.5.0 and fixes source-checkout version reporting so UI deployments do not show unknown.
- Modernizes the Streamlit UI with modular pages, improved chat/session behavior, available-calculator display, better math/report rendering, HTML report downloads, and build/host metadata.
- Adds calculator availability detection during agent initialization and improves calculator selection, including xTB/TBLite alias handling.
- Expands agent workflows with human-in-the-loop support, single-agent routing tests, retry/session fixes, and safer state serialization.
- Adds and improves CLI, memory/session persistence, model routing, MCP client support, RAG, XANES, and evaluation tooling.
- Adds Kubernetes, GHCR, Streamlit Cloud, HPC, and MCP deployment documentation and examples.
Improves CI reliability with dependency pins, Windows serializer fixes, Ruff cleanup, and expanded tests.
tdpham2 and others added 10 commits June 2, 2026 09:56
…us Compute

Introduce a unified execution module with an abstract ExecutionBackend
interface and TaskSpec model, supporting four backends: local
(ProcessPoolExecutor), Parsl, EnsembleLauncher, and Globus Compute.

Includes config factory with resolution order (args > env > config.toml),
HPC configs loader, comprehensive tests, and pytest --run-globus-compute
option for live endpoint tests.
Remove dead num_nodes=1 after raise in aurora_parsl.py and fix
misleading error message. Set _initialized=False at start of
EnsembleLauncherBackend.shutdown() to prevent submitting to a
partially torn-down backend.
When backend=globus_compute, MCP tools now return immediately after
submitting jobs to the remote HPC endpoint instead of blocking until
completion. A new JobTracker tracks submitted futures across tool calls,
and new MCP tools (check_job_status, get_job_results, list_jobs,
cancel_job) let the LLM agent poll for progress and retrieve results.

Non-Globus backends (local, Parsl, EnsembleLauncher) are unchanged and
continue to block until results are ready.

Key changes:
- Add is_async_remote property to ExecutionBackend (True for Globus)
- Add check_endpoint_status() health check to GlobusComputeBackend
- Add JobTracker with batch registration, status, results, cleanup
- Add submit_or_gather() utility that branches on backend type
- Add optional timeout parameter to gather_futures()
- Add register_job_tools() to wire job tools into any MCP server
- Integrate tracker into MACE, XANES, and gRASPA MCP servers
- Add CGFastMCP: FastMCP subclass with integrated execution backend,
  lazy init, built-in job tools, @tool() and @ensemble_tool() decorators
- Refactor EnsembleLauncherBackend with client-only mode (shared
  orchestrator via checkpoint_dir) and managed mode
- Update get_backend() to route client_only vs managed EL initialization
- Rewrite mace_mcp_hpc.py to use CGFastMCP decorators
- Clean up parsl_tools.py: remove dead code, use stdlib logging
- Fix __main__ pickle issue via _fix_module_for_pickle + sys.modules alias
- Add client-only mode demo cell to notebook 3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mport

- parsl_tools.run_mace_core: stop swallowing exceptions and returning None.
  run_ase_core already returns a structured failure dict on simulation
  errors, and programmer errors should propagate.
- cg_fastmcp.ensemble_tool: raise TypeError with a clear message when the
  decorated function does not have exactly one parameter, instead of
  crashing with IndexError at decoration time.
- ensemble_launcher_backend: soft-import ensemble_launcher and defer the
  failure to construction / call time. SYSTEM_CONFIG_REGISTRY is now a
  lazy view backed by builder functions so the module loads cleanly
  without EL installed, restoring the deferred-error behaviour callers
  of chemgraph.execution.config expected.
- persist_file parameter: when set, batch metadata and Globus Compute
  task UUIDs are written to JSON after registration and after results
  are cached, and loaded on init. Allows MCP servers to recover job
  state across restarts.
- TrackedTask.globus_task_id and TrackedTask.future are both optional;
  loaded-from-disk batches have no in-memory Future and are queried
  via the Globus Compute Client directly in get_status.
- Lazy Globus Compute Client with a separate gc_lock for thread safety.
- _wait_for_globus_task_ids polls each ComputeFuture briefly after
  submission to capture the Globus task_id assigned asynchronously
  by the Executor background thread.
- cancel_batch / cleanup_old_batches handle the no-future case.
- init_backend now accepts tracker_kwargs= and forwards it to
  JobTracker(...) in _ensure_backend. Callers can pass
  persist_file= so MCP servers recover job state across restarts.
- set_pre_submit_hook(hook): hook receives each TaskSpec before
  backend.submit() and returns a (possibly mutated) one. Lets a
  server centralise transport concerns -- inline-structure embedding
  for local-submit-to-remote-worker, remote-path rewriting -- instead
  of repeating that logic in every tool body. Wired into the
  @tool, @ensemble_tool, and @schema_fanout_tool submit paths.
- @schema_fanout_tool(worker=...): the decorated function is an
  expander (ensemble schema -> list of per-item args). The framework
  calls worker(item) on the backend for each item and gathers
  results. Preserves the ensemble schema as the agent-facing API
  (one tool call, server-side fanout), complementing @ensemble_tool
  which exposes list[Schema] for callers that want client-side
  enumeration.
JinchuLi2002 and others added 29 commits June 15, 2026 09:51
…tent

The file has always contained JSONC-style // comments and is loaded via
_load_jsonc in chemgraph.academy.core.campaign. The .json extension was
making IDEs flag the comments as parse errors. Rename to .jsonc so the
extension matches the content; the package-data glob in pyproject.toml
already includes *.jsonc, so the package install is unaffected.

Also updates the manifest in chemgraph.academy.campaigns.__init__, the
example-002 notes, and tmpfile names in three academy tests for naming
consistency.

Verified:
  - chemgraph.academy.campaigns.resolve_campaign returns the new path.
  - load_campaign reads it cleanly (5 agents, 1 mcp server).
  - tests/test_academy_campaign.py, test_academy_compute_launcher.py,
    test_academy_exchange_registration.py: 17 passed.
Adds an optional `allowed_tools` field to ChemGraphAgentSpec that
filters the tools an agent sees from its declared MCP servers. Empty
(the default) keeps todays behavior of exposing every tool the agents
servers advertise. Non-empty restricts the agent to the named tools.

Why: the MCP-server-per-agent contract gates capability at the server
level only. An agent declaring `mcp_servers: ["general"]` sees every
tool that server exposes, even when only one or two are relevant to
its mission. That weakens per-agent sandboxing (mission prompt becomes
the enforcement) and bloats the LangChain tool catalog the LLM has to
choose from.

Changes:
  - ChemGraphAgentSpec gains `allowed_tools: tuple[str, ...] = ()`.
  - Validator rejects duplicate entries and the case where allowed_tools
    is non-empty but mcp_servers is empty.
  - MCPServerSupervisor.get_tools accepts allowed_tools: frozenset|None;
    when set, tools whose name is not in the whitelist are skipped, and
    whitelist entries that match nothing log a warning (so typos surface
    without failing the run).
  - daemon.run threads agent_spec.allowed_tools through.
  - example-002 campaign demonstrates the field: structure agents see
    only the SMILES tools, mace-agent sees only run_ase + extract_output_json.
  - 4 new tests in test_academy_campaign.py for parse + validation.
  - 4 new tests in test_academy_mcp_supervisor.py for filter behavior.
…ACE downloads

The in-process MACE path uses mace_mp(model="medium-mpa-0") which
downloads its foundation model from GitHub on first use. Aurora and
Polaris compute nodes can reach external sites only through the ALCF
outbound proxy (proxy.alcf.anl.gov:3128); without these env vars the
download hangs and the mace-agent reports failure.

Add http_proxy / https_proxy / no_proxy to the compute env block, and
to the optional MACE pre-warm snippet, so the documented commands work
out of the box on both systems.

No code changes.
run_ase_core opens output_results_file for write at the end of the
simulation without first ensuring its parent directory exists. Agents
and CLI users routinely point the output at a not-yet-created nested
subdirectory of a shared run dir; the simulation then runs to
completion only to fail with FileNotFoundError: [Errno 2] when it
tries to persist results. Compute time wasted, error message blames
the wrong layer.

Add a single os.makedirs(..., exist_ok=True) on the resolved parent
right after the .json extension check. Idempotent, harmless when the
directory already exists, and surfaces any permission problem before
the calculator gets loaded.

Hit this on example-002 polaris run 012: mace-agent received a
mace_output_directory resource pointing at academy_mace_outputs/, the
agent passed output_results_file as academy_mace_outputs/MOL-002.json,
the directory did not exist on disk yet, every run_ase call failed
identically, mace-agent retried, kept failing.

Test in tests/test_mcp.py mocks load_calculator and runs run_ase_core
against a tmp_path / "deeply/nested/output.json" target.
resolve_campaign_resources rewrites shared_run resource paths to
absolute locations under <run_dir>/shared/ but never actually creates
the directories on disk. Tools that get pointed at one of these
resources have to guess whether their parent exists; the in-process
run_ase tool, for example, did not, and example-002 polaris run 012
saw every mace-agent call fail with FileNotFoundError for a path
under the campaign-declared mace_output_directory.

Make academy uphold the natural contract: if a campaign declares a
shared_run resource, the runtime guarantees the on-disk parent exists
before any agent touches it. Specifically, after resolving the path:

  - kind: directory  -> mkdir -p the resolved path itself
  - kind: file / json -> mkdir -p the resolved parent

The file itself stays the responsibility of the agent that writes it.
mkdir is idempotent so per-rank repetition is harmless.

Test extends test_campaign_resources_resolve_to_shared_run_artifacts
with on-disk assertions, and adds test_resolve_campaign_resources_skips_
non_shared_run_paths confirming we do not create absolute / external
paths.
The dashboard launcher previously required a separate `academy` source
checkout on every remote system because it referenced
`${academy_repo_root}/examples/09-polaris-lm-swarm/uan_http_relay.py`
from the system profile. That dependency was undocumented in the
example-002 e2e guide (which only tells users to sync ChemGraph), so a
fresh user on Aurora hit "No such file or directory" trying to start
the Mac-relay path.

Move the relay script into the chemgraph package as a runtime template
(stdlib-only, no imports beyond socket/threading). The dashboard
launcher now materializes it onto the remote at
$REMOTE_ROOT/.chemgraph/uan_http_relay.py before starting the relay,
via a one-line ssh stdin pipe. start_relay accepts the resulting path
as an argument instead of computing it from profile state.

Side cleanup:
* SystemProfile no longer has academy_repo_root; both
  aurora.template.json and polaris.template.json drop the field.
* Polaris's relay_host_file used to land inside the academy checkout
  (`/academy/uan-relay-18186.host`); normalize to the same shape Aurora
  already used: directly under remote_root.
* Dashboard metadata no longer writes academy_repo_root either; nothing
  downstream consumed it.

Result: the second `academy` source checkout is no longer required on
remote systems. Users only need ChemGraph synced. The Mac-relay path
works the same way on any new host as long as the chemgraph package is
installed.

102 academy + synth tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ChemGraph Academy persistent agent runtime and dashboard
The MACE MCP server embedded each local structure inline and had the
worker re-materialise it to /tmp on every tool call, regardless of
backend. This is only needed for Globus Compute, whose workers run on a
remote host. For local/parsl/ensemble_launcher backends (shared FS with
the server) it was pure overhead -- extra serialization, redundant disk
I/O, and a full_output read-back.

Add a shares_filesystem capability to ExecutionBackend (True by default;
False for Globus Compute, config-overridable via the shares_filesystem
kwarg). The MACE transport hook now embeds inline only when the backend
does not share the filesystem; the worker already no-ops its inline
branch when the key is absent, so shared-FS backends read the input path
directly.
The MACE MCP server embedded each local structure inline and had the
worker re-materialise it to /tmp on every tool call, regardless of
backend. This is only needed for Globus Compute, whose workers run on a
remote host. For local/parsl/ensemble_launcher backends (shared FS with
the server) it was pure overhead -- extra serialization, redundant disk
I/O, and a full_output read-back.

Add a shares_filesystem capability to ExecutionBackend (True by default;
False for Globus Compute, config-overridable via the shares_filesystem
kwarg). The MACE transport hook now embeds inline only when the backend
does not share the filesystem; the worker already no-ops its inline
branch when the key is absent, so shared-FS backends read the input path
directly.
The worker embedded the entire output JSON into the returned result as
full_output when an inline structure was used. Results are already
persisted to output_result_file, so this just bloated the tool response.
Return only what run_mace_core produces; drop the now-unused json import.
The worker embedded the entire output JSON into the returned result as
full_output when an inline structure was used. Results are already
persisted to output_result_file, so this just bloated the tool response.
Return only what run_mace_core produces; drop the now-unused json import.
Under a stdio MCP server, the server's stdout is the JSON-RPC channel.
EnsembleLauncher prints lifecycle notices ("Sent SIGTERM to launcher
process ...") to stdout during orchestrator shutdown, which corrupted the
protocol stream and crashed the client's message parser with a
ValidationError / BrokenResourceError after an otherwise successful run.

Add a fd-level stdout->stderr redirect context manager and wrap the
client/orchestrator teardown calls in shutdown() with it, so the notices
go to stderr instead of the JSON-RPC channel. fd-level dup2 (matching
LocalBackend's worker-stdout guard) catches library/subprocess writes,
not just Python-level sys.stdout.
Layer an fd-level stdout->stderr redirect around shutdown() on top of the
subprocess-based orchestrator rework. The orchestrator subprocess already
redirects to DEVNULL, but two teardown paths can still print to the
JSON-RPC stdout under a stdio MCP server: the in-process client.teardown()
and the `el stop` helper (which inherits the parent's stdout). Wrapping
the whole shutdown() in the fd guard covers both, preventing the
ValidationError / BrokenResourceError crash after an otherwise successful
run.
# Conflicts:
#	src/chemgraph/execution/ensemble_launcher_backend.py
Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes
# Conflicts:
#	src/chemgraph/hpc_configs/aurora_parsl.py
#	src/chemgraph/tools/ase_core.py
demo_parsl_in_job_agent.py forced model_name="argo:gpt-5.4" and a
local argoapi base_url, mirroring the temp config that had also landed
in demo_ensemble_launcher_in_job_agent.py. Restore model selection from
the --model flag (the amain(model=...) parameter) and drop the hardcoded
base_url so the demo honours the user's chosen model.
TestELSystemConfigCrux asserted the exact Crux SystemConfig shape
(ncpus==128, CPU-only) and the registry membership of "crux". These
are tied to one specific machine's hardware layout and don't belong in
the portable unit-test suite. The polaris references elsewhere are left
as-is since they only pass "polaris" as an arbitrary system string to
exercise generic GlobusCompute behaviour.
EnsembleLauncher is an optional, HPC-only dependency (not on PyPI for
Python 3.12), so instantiating EnsembleLauncherBackend() raised
ImportError and hard-failed all nine TestELBackend cases in any env
without it. Add pytest.importorskip("ensemble_launcher") in setup_class
so the class skips cleanly, matching the guard already used by the
GlobusCompute tests.
Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph
@tdpham2 tdpham2 changed the title Add pluggable execution backends and backend-agnostic HPC MCP servers Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants