Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns by tdpham2 · Pull Request #120 · argonne-lcf/ChemGraph

tdpham2 · 2026-05-04T17:04:51Z

What this does

This PR makes ChemGraph's compute layer portable across HPC systems and adds a multi-agent campaign runtime on top of it. It has three parts:

Pluggable execution backends — decouple MCP servers from any single workflow manager.
Backend-agnostic HPC MCP servers — MACE / gRASPA / XANES that submit work through the backend abstraction, with async job tracking and Globus file staging.
Academy campaigns + dashboard — persistent multi-agent screening across nodes, with live event observability.

The original *_parsl.py servers and the ase_core tool layer are untouched.

1. Execution backends (`src/chemgraph/execution/`)

A clean ExecutionBackend interface with four implementations:

Local — ProcessPoolExecutor, zero external deps, good for dev/CI
Parsl — system-aware configs (local, Polaris, Aurora, Crux)
EnsembleLauncher — cluster-mode API for dynamic task submission
Globus Compute — submit to a persistent remote endpoint with no active allocation

Key modules:

base.py — ExecutionBackend (ABC) + TaskSpec (unified Python-callable / shell task with resource hints). Backends expose is_async_remote and shares_filesystem so callers adapt without knowing the implementation.
config.py — get_backend() / get_transfer_manager() factories. Selection hierarchy: explicit arg > env var (CHEMGRAPH_EXECUTION_BACKEND, COMPUTE_SYSTEM) > config.toml [execution] > "local".
local_backend.py, parsl_backend.py, ensemble_launcher_backend.py, globus_compute_backend.py
utils.py — structure-file resolution, async future gathering (bridges concurrent.futures.Future to asyncio), JSONL result writing
job_tracker.py — thread-safe tracking of async-remote batches, persisted to JSON so results survive across sessions
globus_transfer.py — GlobusTransferManager stages files between collections instead of embedding large structures in payloads

hpc_configs/loader.py replaces the duplicated load_parsl_config() helpers, dispatching to per-system factories (local_parsl.py, aurora_parsl.py, polaris_parsl.py, crux_parsl.py) and resolving worker_init from env > submitting env > system fallback.

2. HPC MCP servers (`src/chemgraph/mcp/`)

cg_fastmcp.py — CGFastMCP (FastMCP subclass) submits tool calls to the configured backend as TaskSpec objects; centralizes transport concerns (inline embedding vs. remote paths) and adds a fanout decorator for ensemble tools.
mace_mcp_hpc.py, graspa_mcp_hpc.py, xanes_mcp_hpc.py — build TaskSpec and submit via get_backend() instead of calling Parsl directly.
job_tools.py — registers check_job_status / get_job_results / list_jobs / cancel_job / check_endpoint_status on any server.
transfer_tools.py — registers transfer_files / check_transfer_status / list_remote_files (Globus staging).
hpc_misc_mcp.py — generic JSON artifact inspection.

Schemas now support pre-staged HPC inputs: remote_structure_directory on the MACE/gRASPA ensemble schemas lets workers read directly from a remote path instead of embedding structures inline.

3. Academy multi-agent campaigns (`src/chemgraph/academy/`)

A runtime for persistent multi-agent screening campaigns over MPI with Redis-backed messaging:

core/ — logical agent wrapping the ChemGraph turn primitive, JSONC campaign specs with per-agent allowed_tools whitelists, peer-messaging action tools.
runtime/ — MPI daemon, multi-node compute launcher, and a local dashboard launcher that mirrors remote run state and relays the Argo endpoint.
observability/ — append-only JSONL event log + run-status artifacts (tolerant parsing for live polling).
dashboard/ — stateless HTTP server + web UI reading from a run directory.

For single-agent CLI runs, --trace-dir (via cli/trace.py) emits the same event schema so runs are viewable in the dashboard. New CLI subcommands wire up dashboard, academy run-compute, academy mpi-daemon, academy dashboard, and academy campaigns.

Agent event plumbing was refactored: event translation lives in agent/events.py, the turn loop (with on_event / terminal_tool_names hooks) in agent/turn.py, keeping llm_agent.py clean for CLI users.

Supporting changes

Models — new models/settings.py (LLMSettings) centralizes endpoint config; openai.py normalizes Argo model names by endpoint type (local shim vs. hosted wire format, overridable via CHEMGRAPH_ARGO_MODEL_FORMAT).
Tools — thread-safe MACE calculator loading (avoids torch symbolic_trace crashes under parallel workers); output parent directories created before writing in ase_core.py and cheminformatics_core.py.
Bug fixes — aurora_parsl.py unreachable code + misleading error removed; EnsembleLauncherBackend.shutdown() no longer leaves _initialized=True on partial failure.

New dependencies (all optional extras)

ensemble_launcher = ["ensemble-launcher"]
globus_compute   = ["globus-compute-sdk"]
academy          = ["academy-py", "httpx", "redis"]

The local backend works with zero additional installs.

Testing

New suites cover execution backends, job tracking, model normalization, tool-adapter validation, MCP discovery, and the Academy campaign/runtime/dashboard paths:

test_execution.py, test_job_tracker.py
test_openai_model_normalization.py, test_tool_adapter_validation.py, test_mcp.py
test_academy_*.py (campaign, mcp_supervisor, reasoning, dashboard, launcher, exchange, payloads)

Live Globus Compute integration tests are opt-in via --run-globus-compute. Backend-specific machine tests (Polaris/Aurora Parsl, EnsembleLauncher multi-node) are skipped when their dependencies/allocations are unavailable.

Test plan

pytest tests/ — execution, job-tracker, model, MCP, and Academy suites pass; existing tests green
Import chain (from chemgraph.execution import get_backend, TaskSpec)
Live Globus Compute endpoint (--run-globus-compute)
Parsl backend on Polaris/Aurora
EnsembleLauncher on a multi-node allocation
End-to-end Academy campaign (example-002-mace-ensemble-screening) + dashboard

…us Compute Introduce a unified execution module with an abstract ExecutionBackend interface and TaskSpec model, supporting four backends: local (ProcessPoolExecutor), Parsl, EnsembleLauncher, and Globus Compute. Includes config factory with resolution order (args > env > config.toml), HPC configs loader, comprehensive tests, and pytest --run-globus-compute option for live endpoint tests.

Remove dead num_nodes=1 after raise in aurora_parsl.py and fix misleading error message. Set _initialized=False at start of EnsembleLauncherBackend.shutdown() to prevent submitting to a partially torn-down backend.

…emote

When backend=globus_compute, MCP tools now return immediately after submitting jobs to the remote HPC endpoint instead of blocking until completion. A new JobTracker tracks submitted futures across tool calls, and new MCP tools (check_job_status, get_job_results, list_jobs, cancel_job) let the LLM agent poll for progress and retrieve results. Non-Globus backends (local, Parsl, EnsembleLauncher) are unchanged and continue to block until results are ready. Key changes: - Add is_async_remote property to ExecutionBackend (True for Globus) - Add check_endpoint_status() health check to GlobusComputeBackend - Add JobTracker with batch registration, status, results, cleanup - Add submit_or_gather() utility that branches on backend type - Add optional timeout parameter to gather_futures() - Add register_job_tools() to wire job tools into any MCP server - Integrate tracker into MACE, XANES, and gRASPA MCP servers

@tool

- Add CGFastMCP: FastMCP subclass with integrated execution backend, lazy init, built-in job tools, @tool() and @ensemble_tool() decorators - Refactor EnsembleLauncherBackend with client-only mode (shared orchestrator via checkpoint_dir) and managed mode - Update get_backend() to route client_only vs managed EL initialization - Rewrite mace_mcp_hpc.py to use CGFastMCP decorators - Clean up parsl_tools.py: remove dead code, use stdlib logging - Fix __main__ pickle issue via _fix_module_for_pickle + sys.modules alias - Add client-only mode demo cell to notebook 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Modified the EL backend implemenations, and added a EL backend test

…mport - parsl_tools.run_mace_core: stop swallowing exceptions and returning None. run_ase_core already returns a structured failure dict on simulation errors, and programmer errors should propagate. - cg_fastmcp.ensemble_tool: raise TypeError with a clear message when the decorated function does not have exactly one parameter, instead of crashing with IndexError at decoration time. - ensemble_launcher_backend: soft-import ensemble_launcher and defer the failure to construction / call time. SYSTEM_CONFIG_REGISTRY is now a lazy view backed by builder functions so the module loads cleanly without EL installed, restoring the deferred-error behaviour callers of chemgraph.execution.config expected.

- persist_file parameter: when set, batch metadata and Globus Compute task UUIDs are written to JSON after registration and after results are cached, and loaded on init. Allows MCP servers to recover job state across restarts. - TrackedTask.globus_task_id and TrackedTask.future are both optional; loaded-from-disk batches have no in-memory Future and are queried via the Globus Compute Client directly in get_status. - Lazy Globus Compute Client with a separate gc_lock for thread safety. - _wait_for_globus_task_ids polls each ComputeFuture briefly after submission to capture the Globus task_id assigned asynchronously by the Executor background thread. - cancel_batch / cleanup_old_batches handle the no-future case.

@tool

- init_backend now accepts tracker_kwargs= and forwards it to JobTracker(...) in _ensure_backend. Callers can pass persist_file= so MCP servers recover job state across restarts. - set_pre_submit_hook(hook): hook receives each TaskSpec before backend.submit() and returns a (possibly mutated) one. Lets a server centralise transport concerns -- inline-structure embedding for local-submit-to-remote-worker, remote-path rewriting -- instead of repeating that logic in every tool body. Wired into the @tool, @ensemble_tool, and @schema_fanout_tool submit paths. - @schema_fanout_tool(worker=...): the decorated function is an expander (ensemble schema -> list of per-item args). The framework calls worker(item) on the backend for each item and gathers results. Preserves the ensemble schema as the agent-facing API (one tool call, server-side fanout), complementing @ensemble_tool which exposes list[Schema] for callers that want client-side enumeration.

…ersistence - mace_input_schema_ensemble / graspa_input_schema_ensemble: new remote_structure_directory field for pre-staged HPC files (paired with the upcoming transfer_files tool). input_structure_directory now defaults to empty string so callers can pass either. - mace_input_schema/_ensemble model description spells out that 'mace_mp' is the calculator type, not a model name -- LLMs were confusing the two. - Nullable schema fields (driver, model, wall_time) typed as str|None / float|None for correct OpenAPI schema generation. - GlobusComputeBackend._ensure_executor re-creates the Executor when it has been shut down (e.g. after a remote task failure). Uses getattr() so we don't depend on the SDK's private _stopped attr existing. - check_endpoint_status logs exc_info on failure for easier debugging. - xanes_mcp_hpc: JobTracker(persist_file=~/.chemgraph/xanes_jobs.json) so XANES job state survives MCP server restarts. Instructions updated to tell the LLM to surface batch_ids to the user.

@tool

- execution/globus_transfer.py: GlobusTransferManager wraps the globus_sdk TransferClient with token caching, batched transfer_files / wait_for_transfer / check_transfer_status / list_remote_directory. Lazy globus_sdk import, lazy auth. - execution/config.get_transfer_manager(): builds a manager from [execution.globus_transfer] in config.toml with env var overrides (GLOBUS_TRANSFER_SOURCE_ENDPOINT_ID, _DESTINATION_ENDPOINT_ID, _DESTINATION_BASE_PATH). Returns None when not configured so MCP servers can skip registration silently. - mcp/transfer_tools.register_transfer_tools(): registers transfer_files, check_transfer_status, list_remote_files on a FastMCP/CGFastMCP server. Uses mcp.add_tool() (not the backend-submitting @tool() decorator) because these are orchestration tools, not compute tasks -- they call the Globus Transfer API directly from the MCP server process. - get_backend() globus_compute endpoint_id fallback now treats empty-string endpoint_id as unset, matching the GLOBUS_COMPUTE_ ENDPOINT_ID env-var override behaviour.

…GFastMCP run_mace_single and run_mace_ensemble were collapsed to bare run_mace_core(params) calls in PR #127, dropping inline-structure embedding, remote-path support, JobTracker persistence, and the Globus Transfer registration that 51ba171 had built. This restores all of that on top of the new CGFastMCP framework. - Worker is now a separate function _mace_worker(job: dict) that handles two transport keys on the worker FS: remote_structure_file (use the path directly) and inline_structure (materialise an AtomsData dict to a temp XYZ). Embeds full_output back into the result for inline calls so callers do not need remote FS access. - Pre-submit hook _mace_transport_hook centralises the schema -> job-dict conversion, mace_mp -> medium-mpa-0 model normalisation, and inline embedding (when the input file exists on the submitting host). Hook rewrites task.callable from run_mace_single to _mace_worker so the LLM still sees a clean schema-shaped tool. - run_mace_ensemble switches to @schema_fanout_tool with a server-side expander, preserving the directory-driven UX (single LLM call instead of N). Local mode enumerates files via resolve_structure_files; remote mode submits a backend probe to ls remote_structure_directory and builds remote_structure_file per item. - extract_output_json registered via mcp.add_tool() (orchestration, no backend wrap). transfer_files/check_transfer_status/ list_remote_files registered conditionally when get_transfer_manager() finds [execution.globus_transfer] config. - __main__ now wires tracker_kwargs={persist_file: ~/.chemgraph/ mace_jobs.json} so MACE batches survive MCP server restarts. - Drop `from __future__ import annotations`: forward refs break FastMCP's signature introspection because the wrapper's __globals__ is cg_fastmcp's, not the tool module's.

Wraps the Academy distributed agent framework with ChemGraph LLM agents for federated HPC screening workflows. Decoupled from the existing pipeline -- no chemgraph.cli / chemgraph.agent / chemgraph.eval references; only the lazily imported chemgraph.agent.llm_agent.ChemGraph. - ChemGraphAgent: Academy Agent wrapping a single ChemGraph instance, exposes run_query / get_info actions. - ScreeningAgent: iterates a molecule batch, writes per-result JSONs for fault-tolerant aggregation. Failed-molecule records now store str(exc) so the actual exception message survives. - CoordinatorAgent: polls a results dir, optionally analyses results via an LLM, suggests follow-up molecules. - AcademyConfig + build_manager: bridge config.toml to Academy Manager / Exchange / Launcher (local, Redis, Parsl, Globus Compute). - RateLimiter: stdlib async token-bucket for shared per-provider LLM quotas across agents. Lazy imports in __init__.py let the package load without the optional academy-py dependency; ChemGraphAgent / ScreeningAgent / CoordinatorAgent raise ModuleNotFoundError on access if academy-py is missing, while AcademyConfig and RateLimiter remain usable. pyproject's academy optional-dep + pytest marker are already in HEAD (commit 04bcc8a). tests/test_academy.py and scripts/academy_example/ remain untracked and will land in follow-ups.

- globus_transfer.py: disambiguate same-basename inputs with a numeric suffix so two files that share a name (e.g. /a/in.cif and /b/in.cif) don't silently overwrite each other on the remote collection. - job_tracker.py: promote the "no Globus task_id within timeout" message to a warning at submit time, and emit a per-task warning at reload time for batches restored without a task_id (those tasks cannot be queried via the Globus Compute API and would otherwise be silently orphaned across server restarts). - globus_compute_backend.py: catch "executor stopped" exceptions in submit(), rebuild the Executor, and retry once. The previous _ensure_executor relied on the SDK's private _stopped attribute, which fails silently if the SDK exposes the shutdown state differently. - cg_fastmcp.py: wrap _apply_pre_submit_hook in try/except and re-raise hook failures as a ValueError naming the hook and task_id so they surface as a structured tool error instead of an opaque traceback.

Both servers now mirror the mace_mcp_hpc.py pattern: - CGFastMCP with lazy backend initialisation via init_backend(); the worker subprocesses re-importing the module no longer instantiate a backend at import time. - Job-management tools (check_job_status, get_job_results, list_jobs, cancel_job, check_endpoint_status) are auto-registered by CGFastMCP._register_job_tools; the external register_job_tools call is dropped. - __main__ wires init_backend(tracker_kwargs={"persist_file": ...}) and pairs run_mcp_server with shutdown_backend in finally. This also closes a real bug in graspa_mcp_hpc.py, which was instantiating JobTracker() with no persist_file and silently losing job state across restarts despite the server's instructions promising persistence. - Globus Transfer tools (transfer_files, check_transfer_status, list_remote_files) are registered on both servers when the transfer manager is configured, matching the existing MACE behaviour. - gRASPA expander now supports remote_structure_directory the same way MACE does: a one-shot probe task lists CIFs on the remote endpoint and the worker reads them directly from the staged path. - Ensemble flows use the schema_fanout_tool decorator; per-job structure metadata is propagated through the worker output (since the framework meta is only the index). Legacy *_mcp_parsl.py modules now raise a DeprecationWarning at import pointing to the *_hpc.py replacement; they remain functional because scripts/mcp_xanes_example/ still imports xanes_mcp_parsl.

- Update ChemGraph container and add Kubernetes deployment. Add Kubernetes deployment support #123 - Add execution layer to ChemGraph, including EnsembleLauncher, Parsl, Globus compute and Globus transfer. Add - pluggable execution backends and backend-agnostic HPC MCP servers #120 Modified the EL backend implemenations, and added a EL backend test #127 - Add initial Academy integration Add pluggable execution backends and backend-agnostic HPC MCP servers #120 - Updates package metadata to version 0.5.0 and fixes source-checkout version reporting so UI deployments do not show unknown. - Modernizes the Streamlit UI with modular pages, improved chat/session behavior, available-calculator display, better math/report rendering, HTML report downloads, and build/host metadata. - Adds calculator availability detection during agent initialization and improves calculator selection, including xTB/TBLite alias handling. - Expands agent workflows with human-in-the-loop support, single-agent routing tests, retry/session fixes, and safer state serialization. - Adds and improves CLI, memory/session persistence, model routing, MCP client support, RAG, XANES, and evaluation tooling. - Adds Kubernetes, GHCR, Streamlit Cloud, HPC, and MCP deployment documentation and examples. Improves CI reliability with dependency pins, Windows serializer fixes, Ruff cleanup, and expanded tests.

…us Compute Introduce a unified execution module with an abstract ExecutionBackend interface and TaskSpec model, supporting four backends: local (ProcessPoolExecutor), Parsl, EnsembleLauncher, and Globus Compute. Includes config factory with resolution order (args > env > config.toml), HPC configs loader, comprehensive tests, and pytest --run-globus-compute option for live endpoint tests.

Remove dead num_nodes=1 after raise in aurora_parsl.py and fix misleading error message. Set _initialized=False at start of EnsembleLauncherBackend.shutdown() to prevent submitting to a partially torn-down backend.

…emote

When backend=globus_compute, MCP tools now return immediately after submitting jobs to the remote HPC endpoint instead of blocking until completion. A new JobTracker tracks submitted futures across tool calls, and new MCP tools (check_job_status, get_job_results, list_jobs, cancel_job) let the LLM agent poll for progress and retrieve results. Non-Globus backends (local, Parsl, EnsembleLauncher) are unchanged and continue to block until results are ready. Key changes: - Add is_async_remote property to ExecutionBackend (True for Globus) - Add check_endpoint_status() health check to GlobusComputeBackend - Add JobTracker with batch registration, status, results, cleanup - Add submit_or_gather() utility that branches on backend type - Add optional timeout parameter to gather_futures() - Add register_job_tools() to wire job tools into any MCP server - Integrate tracker into MACE, XANES, and gRASPA MCP servers

@tool

- Add CGFastMCP: FastMCP subclass with integrated execution backend, lazy init, built-in job tools, @tool() and @ensemble_tool() decorators - Refactor EnsembleLauncherBackend with client-only mode (shared orchestrator via checkpoint_dir) and managed mode - Update get_backend() to route client_only vs managed EL initialization - Rewrite mace_mcp_hpc.py to use CGFastMCP decorators - Clean up parsl_tools.py: remove dead code, use stdlib logging - Fix __main__ pickle issue via _fix_module_for_pickle + sys.modules alias - Add client-only mode demo cell to notebook 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mport - parsl_tools.run_mace_core: stop swallowing exceptions and returning None. run_ase_core already returns a structured failure dict on simulation errors, and programmer errors should propagate. - cg_fastmcp.ensemble_tool: raise TypeError with a clear message when the decorated function does not have exactly one parameter, instead of crashing with IndexError at decoration time. - ensemble_launcher_backend: soft-import ensemble_launcher and defer the failure to construction / call time. SYSTEM_CONFIG_REGISTRY is now a lazy view backed by builder functions so the module loads cleanly without EL installed, restoring the deferred-error behaviour callers of chemgraph.execution.config expected.

- persist_file parameter: when set, batch metadata and Globus Compute task UUIDs are written to JSON after registration and after results are cached, and loaded on init. Allows MCP servers to recover job state across restarts. - TrackedTask.globus_task_id and TrackedTask.future are both optional; loaded-from-disk batches have no in-memory Future and are queried via the Globus Compute Client directly in get_status. - Lazy Globus Compute Client with a separate gc_lock for thread safety. - _wait_for_globus_task_ids polls each ComputeFuture briefly after submission to capture the Globus task_id assigned asynchronously by the Executor background thread. - cancel_batch / cleanup_old_batches handle the no-future case.

@tool

- init_backend now accepts tracker_kwargs= and forwards it to JobTracker(...) in _ensure_backend. Callers can pass persist_file= so MCP servers recover job state across restarts. - set_pre_submit_hook(hook): hook receives each TaskSpec before backend.submit() and returns a (possibly mutated) one. Lets a server centralise transport concerns -- inline-structure embedding for local-submit-to-remote-worker, remote-path rewriting -- instead of repeating that logic in every tool body. Wired into the @tool, @ensemble_tool, and @schema_fanout_tool submit paths. - @schema_fanout_tool(worker=...): the decorated function is an expander (ensemble schema -> list of per-item args). The framework calls worker(item) on the backend for each item and gathers results. Preserves the ensemble schema as the agent-facing API (one tool call, server-side fanout), complementing @ensemble_tool which exposes list[Schema] for callers that want client-side enumeration.

…tent The file has always contained JSONC-style // comments and is loaded via _load_jsonc in chemgraph.academy.core.campaign. The .json extension was making IDEs flag the comments as parse errors. Rename to .jsonc so the extension matches the content; the package-data glob in pyproject.toml already includes *.jsonc, so the package install is unaffected. Also updates the manifest in chemgraph.academy.campaigns.__init__, the example-002 notes, and tmpfile names in three academy tests for naming consistency. Verified: - chemgraph.academy.campaigns.resolve_campaign returns the new path. - load_campaign reads it cleanly (5 agents, 1 mcp server). - tests/test_academy_campaign.py, test_academy_compute_launcher.py, test_academy_exchange_registration.py: 17 passed.

Adds an optional `allowed_tools` field to ChemGraphAgentSpec that filters the tools an agent sees from its declared MCP servers. Empty (the default) keeps todays behavior of exposing every tool the agents servers advertise. Non-empty restricts the agent to the named tools. Why: the MCP-server-per-agent contract gates capability at the server level only. An agent declaring `mcp_servers: ["general"]` sees every tool that server exposes, even when only one or two are relevant to its mission. That weakens per-agent sandboxing (mission prompt becomes the enforcement) and bloats the LangChain tool catalog the LLM has to choose from. Changes: - ChemGraphAgentSpec gains `allowed_tools: tuple[str, ...] = ()`. - Validator rejects duplicate entries and the case where allowed_tools is non-empty but mcp_servers is empty. - MCPServerSupervisor.get_tools accepts allowed_tools: frozenset|None; when set, tools whose name is not in the whitelist are skipped, and whitelist entries that match nothing log a warning (so typos surface without failing the run). - daemon.run threads agent_spec.allowed_tools through. - example-002 campaign demonstrates the field: structure agents see only the SMILES tools, mace-agent sees only run_ase + extract_output_json. - 4 new tests in test_academy_campaign.py for parse + validation. - 4 new tests in test_academy_mcp_supervisor.py for filter behavior.

…ACE downloads The in-process MACE path uses mace_mp(model="medium-mpa-0") which downloads its foundation model from GitHub on first use. Aurora and Polaris compute nodes can reach external sites only through the ALCF outbound proxy (proxy.alcf.anl.gov:3128); without these env vars the download hangs and the mace-agent reports failure. Add http_proxy / https_proxy / no_proxy to the compute env block, and to the optional MACE pre-warm snippet, so the documented commands work out of the box on both systems. No code changes.

run_ase_core opens output_results_file for write at the end of the simulation without first ensuring its parent directory exists. Agents and CLI users routinely point the output at a not-yet-created nested subdirectory of a shared run dir; the simulation then runs to completion only to fail with FileNotFoundError: [Errno 2] when it tries to persist results. Compute time wasted, error message blames the wrong layer. Add a single os.makedirs(..., exist_ok=True) on the resolved parent right after the .json extension check. Idempotent, harmless when the directory already exists, and surfaces any permission problem before the calculator gets loaded. Hit this on example-002 polaris run 012: mace-agent received a mace_output_directory resource pointing at academy_mace_outputs/, the agent passed output_results_file as academy_mace_outputs/MOL-002.json, the directory did not exist on disk yet, every run_ase call failed identically, mace-agent retried, kept failing. Test in tests/test_mcp.py mocks load_calculator and runs run_ase_core against a tmp_path / "deeply/nested/output.json" target.

resolve_campaign_resources rewrites shared_run resource paths to absolute locations under <run_dir>/shared/ but never actually creates the directories on disk. Tools that get pointed at one of these resources have to guess whether their parent exists; the in-process run_ase tool, for example, did not, and example-002 polaris run 012 saw every mace-agent call fail with FileNotFoundError for a path under the campaign-declared mace_output_directory. Make academy uphold the natural contract: if a campaign declares a shared_run resource, the runtime guarantees the on-disk parent exists before any agent touches it. Specifically, after resolving the path: - kind: directory -> mkdir -p the resolved path itself - kind: file / json -> mkdir -p the resolved parent The file itself stays the responsibility of the agent that writes it. mkdir is idempotent so per-rank repetition is harmless. Test extends test_campaign_resources_resolve_to_shared_run_artifacts with on-disk assertions, and adds test_resolve_campaign_resources_skips_ non_shared_run_paths confirming we do not create absolute / external paths.

The dashboard launcher previously required a separate `academy` source checkout on every remote system because it referenced `${academy_repo_root}/examples/09-polaris-lm-swarm/uan_http_relay.py` from the system profile. That dependency was undocumented in the example-002 e2e guide (which only tells users to sync ChemGraph), so a fresh user on Aurora hit "No such file or directory" trying to start the Mac-relay path. Move the relay script into the chemgraph package as a runtime template (stdlib-only, no imports beyond socket/threading). The dashboard launcher now materializes it onto the remote at $REMOTE_ROOT/.chemgraph/uan_http_relay.py before starting the relay, via a one-line ssh stdin pipe. start_relay accepts the resulting path as an argument instead of computing it from profile state. Side cleanup: * SystemProfile no longer has academy_repo_root; both aurora.template.json and polaris.template.json drop the field. * Polaris's relay_host_file used to land inside the academy checkout (`/academy/uan-relay-18186.host`); normalize to the same shape Aurora already used: directly under remote_root. * Dashboard metadata no longer writes academy_repo_root either; nothing downstream consumed it. Result: the second `academy` source checkout is no longer required on remote systems. Users only need ChemGraph synced. The Mac-relay path works the same way on any new host as long as the chemgraph package is installed. 102 academy + synth tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add ChemGraph Academy persistent agent runtime and dashboard

The MACE MCP server embedded each local structure inline and had the worker re-materialise it to /tmp on every tool call, regardless of backend. This is only needed for Globus Compute, whose workers run on a remote host. For local/parsl/ensemble_launcher backends (shared FS with the server) it was pure overhead -- extra serialization, redundant disk I/O, and a full_output read-back. Add a shares_filesystem capability to ExecutionBackend (True by default; False for Globus Compute, config-overridable via the shares_filesystem kwarg). The MACE transport hook now embeds inline only when the backend does not share the filesystem; the worker already no-ops its inline branch when the key is absent, so shared-FS backends read the input path directly.

The worker embedded the entire output JSON into the returned result as full_output when an inline structure was used. Results are already persisted to output_result_file, so this just bloated the tool response. Return only what run_mace_core produces; drop the now-unused json import.

Under a stdio MCP server, the server's stdout is the JSON-RPC channel. EnsembleLauncher prints lifecycle notices ("Sent SIGTERM to launcher process ...") to stdout during orchestrator shutdown, which corrupted the protocol stream and crashed the client's message parser with a ValidationError / BrokenResourceError after an otherwise successful run. Add a fd-level stdout->stderr redirect context manager and wrap the client/orchestrator teardown calls in shutdown() with it, so the notices go to stderr instead of the JSON-RPC channel. fd-level dup2 (matching LocalBackend's worker-stdout guard) catches library/subprocess writes, not just Python-level sys.stdout.

Layer an fd-level stdout->stderr redirect around shutdown() on top of the subprocess-based orchestrator rework. The orchestrator subprocess already redirects to DEVNULL, but two teardown paths can still print to the JSON-RPC stdout under a stdio MCP server: the in-process client.teardown() and the `el stop` helper (which inherits the parent's stdout). Wrapping the whole shutdown() in the fd guard covers both, preventing the ValidationError / BrokenResourceError crash after an otherwise successful run.

# Conflicts: # src/chemgraph/execution/ensemble_launcher_backend.py

Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes

# Conflicts: # src/chemgraph/hpc_configs/aurora_parsl.py # src/chemgraph/tools/ase_core.py

demo_parsl_in_job_agent.py forced model_name="argo:gpt-5.4" and a local argoapi base_url, mirroring the temp config that had also landed in demo_ensemble_launcher_in_job_agent.py. Restore model selection from the --model flag (the amain(model=...) parameter) and drop the hardcoded base_url so the demo honours the user's chosen model.

TestELSystemConfigCrux asserted the exact Crux SystemConfig shape (ncpus==128, CPU-only) and the registry membership of "crux". These are tied to one specific machine's hardware layout and don't belong in the portable unit-test suite. The polaris references elsewhere are left as-is since they only pass "polaris" as an arbitrary system string to exercise generic GlobusCompute behaviour.

EnsembleLauncher is an optional, HPC-only dependency (not on PyPI for Python 3.12), so instantiating EnsembleLauncherBackend() raised ImportError and hard-failed all nine TestELBackend cases in any env without it. Add pytest.importorskip("ensemble_launcher") in setup_class so the class skips cleanly, matching the guard already used by the GlobusCompute tests.

Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph

tdpham2 mentioned this pull request May 7, 2026

Merge dev to main #124

Merged

tdpham2 added 2 commits May 7, 2026 16:20

tdpham2 force-pushed the dev-globus branch from 1831aa9 to 4878740 Compare May 7, 2026 21:21

tdpham2 and others added 15 commits May 14, 2026 13:42

Update Globus config

a8feddb

Add inline structure for file transferring between local and globus r…

3144c6a

…emote

Modified the EL backend implemenations, and added a EL backend test

ae7963d

Merge pull request #127 from argonne-lcf/dev-globus_HT

28432be

Modified the EL backend implemenations, and added a EL backend test

tdpham2 and others added 10 commits June 2, 2026 09:56

Update Globus config

39f28a1

Add inline structure for file transferring between local and globus r…

b13fc53

…emote

Modified the EL backend implemenations, and added a EL backend test

35a2d65

JinchuLi2002 and others added 29 commits June 15, 2026 09:51

added a one off logging to run_ase_core

a2e27a9

added ppn to task spec in demo el

f56713d

added try except block in demo chemistry

aefed79

Merge pull request #131 from JinchuLi2002/dev-globus

d3f3d41

Add ChemGraph Academy persistent agent runtime and dashboard

added -ppn and --ngpus_per_process to mcp demos

64d087a

added a counter in cg mcp to make task_ids unique

09f972b

adding some temp cg config dor argo

507d518

moved el orchestrator to a subprocess

debdab4

added logging in el backend

b43d390

added better cleanup of el subprocess

3feedcd

Merge remote-tracking branch 'origin/dev-globus-hpc' into el_test

855be82

# Conflicts: # src/chemgraph/execution/ensemble_launcher_backend.py

Merge pull request #133 from argonne-lcf/el_test

7ec8dec

Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes

Merge remote-tracking branch 'origin/dev-globus' into dev-globus-hpc

6741d2e

# Conflicts: # src/chemgraph/hpc_configs/aurora_parsl.py # src/chemgraph/tools/ase_core.py

Merge pull request #132 from argonne-lcf/dev-globus-hpc

3a541a4

Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph

tdpham2 changed the title ~~Add pluggable execution backends and backend-agnostic HPC MCP servers~~ Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns#120

Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns#120
tdpham2 wants to merge 123 commits into
devfrom
dev-globus

tdpham2 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tdpham2 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

1. Execution backends (src/chemgraph/execution/)

2. HPC MCP servers (src/chemgraph/mcp/)

3. Academy multi-agent campaigns (src/chemgraph/academy/)

Supporting changes

New dependencies (all optional extras)

Testing

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdpham2 commented May 4, 2026 •

edited

Loading

1. Execution backends (`src/chemgraph/execution/`)

2. HPC MCP servers (`src/chemgraph/mcp/`)

3. Academy multi-agent campaigns (`src/chemgraph/academy/`)