Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns#120
Open
tdpham2 wants to merge 123 commits into
Open
Add pluggable execution backends, HPC MCP servers, and Academy multi-agent campaigns#120tdpham2 wants to merge 123 commits into
tdpham2 wants to merge 123 commits into
Conversation
Merged
…us Compute Introduce a unified execution module with an abstract ExecutionBackend interface and TaskSpec model, supporting four backends: local (ProcessPoolExecutor), Parsl, EnsembleLauncher, and Globus Compute. Includes config factory with resolution order (args > env > config.toml), HPC configs loader, comprehensive tests, and pytest --run-globus-compute option for live endpoint tests.
Remove dead num_nodes=1 after raise in aurora_parsl.py and fix misleading error message. Set _initialized=False at start of EnsembleLauncherBackend.shutdown() to prevent submitting to a partially torn-down backend.
When backend=globus_compute, MCP tools now return immediately after submitting jobs to the remote HPC endpoint instead of blocking until completion. A new JobTracker tracks submitted futures across tool calls, and new MCP tools (check_job_status, get_job_results, list_jobs, cancel_job) let the LLM agent poll for progress and retrieve results. Non-Globus backends (local, Parsl, EnsembleLauncher) are unchanged and continue to block until results are ready. Key changes: - Add is_async_remote property to ExecutionBackend (True for Globus) - Add check_endpoint_status() health check to GlobusComputeBackend - Add JobTracker with batch registration, status, results, cleanup - Add submit_or_gather() utility that branches on backend type - Add optional timeout parameter to gather_futures() - Add register_job_tools() to wire job tools into any MCP server - Integrate tracker into MACE, XANES, and gRASPA MCP servers
- Add CGFastMCP: FastMCP subclass with integrated execution backend, lazy init, built-in job tools, @tool() and @ensemble_tool() decorators - Refactor EnsembleLauncherBackend with client-only mode (shared orchestrator via checkpoint_dir) and managed mode - Update get_backend() to route client_only vs managed EL initialization - Rewrite mace_mcp_hpc.py to use CGFastMCP decorators - Clean up parsl_tools.py: remove dead code, use stdlib logging - Fix __main__ pickle issue via _fix_module_for_pickle + sys.modules alias - Add client-only mode demo cell to notebook 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Modified the EL backend implemenations, and added a EL backend test
…mport - parsl_tools.run_mace_core: stop swallowing exceptions and returning None. run_ase_core already returns a structured failure dict on simulation errors, and programmer errors should propagate. - cg_fastmcp.ensemble_tool: raise TypeError with a clear message when the decorated function does not have exactly one parameter, instead of crashing with IndexError at decoration time. - ensemble_launcher_backend: soft-import ensemble_launcher and defer the failure to construction / call time. SYSTEM_CONFIG_REGISTRY is now a lazy view backed by builder functions so the module loads cleanly without EL installed, restoring the deferred-error behaviour callers of chemgraph.execution.config expected.
- persist_file parameter: when set, batch metadata and Globus Compute task UUIDs are written to JSON after registration and after results are cached, and loaded on init. Allows MCP servers to recover job state across restarts. - TrackedTask.globus_task_id and TrackedTask.future are both optional; loaded-from-disk batches have no in-memory Future and are queried via the Globus Compute Client directly in get_status. - Lazy Globus Compute Client with a separate gc_lock for thread safety. - _wait_for_globus_task_ids polls each ComputeFuture briefly after submission to capture the Globus task_id assigned asynchronously by the Executor background thread. - cancel_batch / cleanup_old_batches handle the no-future case.
- init_backend now accepts tracker_kwargs= and forwards it to JobTracker(...) in _ensure_backend. Callers can pass persist_file= so MCP servers recover job state across restarts. - set_pre_submit_hook(hook): hook receives each TaskSpec before backend.submit() and returns a (possibly mutated) one. Lets a server centralise transport concerns -- inline-structure embedding for local-submit-to-remote-worker, remote-path rewriting -- instead of repeating that logic in every tool body. Wired into the @tool, @ensemble_tool, and @schema_fanout_tool submit paths. - @schema_fanout_tool(worker=...): the decorated function is an expander (ensemble schema -> list of per-item args). The framework calls worker(item) on the backend for each item and gathers results. Preserves the ensemble schema as the agent-facing API (one tool call, server-side fanout), complementing @ensemble_tool which exposes list[Schema] for callers that want client-side enumeration.
…ersistence - mace_input_schema_ensemble / graspa_input_schema_ensemble: new remote_structure_directory field for pre-staged HPC files (paired with the upcoming transfer_files tool). input_structure_directory now defaults to empty string so callers can pass either. - mace_input_schema/_ensemble model description spells out that 'mace_mp' is the calculator type, not a model name -- LLMs were confusing the two. - Nullable schema fields (driver, model, wall_time) typed as str|None / float|None for correct OpenAPI schema generation. - GlobusComputeBackend._ensure_executor re-creates the Executor when it has been shut down (e.g. after a remote task failure). Uses getattr() so we don't depend on the SDK's private _stopped attr existing. - check_endpoint_status logs exc_info on failure for easier debugging. - xanes_mcp_hpc: JobTracker(persist_file=~/.chemgraph/xanes_jobs.json) so XANES job state survives MCP server restarts. Instructions updated to tell the LLM to surface batch_ids to the user.
- execution/globus_transfer.py: GlobusTransferManager wraps the globus_sdk TransferClient with token caching, batched transfer_files / wait_for_transfer / check_transfer_status / list_remote_directory. Lazy globus_sdk import, lazy auth. - execution/config.get_transfer_manager(): builds a manager from [execution.globus_transfer] in config.toml with env var overrides (GLOBUS_TRANSFER_SOURCE_ENDPOINT_ID, _DESTINATION_ENDPOINT_ID, _DESTINATION_BASE_PATH). Returns None when not configured so MCP servers can skip registration silently. - mcp/transfer_tools.register_transfer_tools(): registers transfer_files, check_transfer_status, list_remote_files on a FastMCP/CGFastMCP server. Uses mcp.add_tool() (not the backend-submitting @tool() decorator) because these are orchestration tools, not compute tasks -- they call the Globus Transfer API directly from the MCP server process. - get_backend() globus_compute endpoint_id fallback now treats empty-string endpoint_id as unset, matching the GLOBUS_COMPUTE_ ENDPOINT_ID env-var override behaviour.
…GFastMCP run_mace_single and run_mace_ensemble were collapsed to bare run_mace_core(params) calls in PR #127, dropping inline-structure embedding, remote-path support, JobTracker persistence, and the Globus Transfer registration that 51ba171 had built. This restores all of that on top of the new CGFastMCP framework. - Worker is now a separate function _mace_worker(job: dict) that handles two transport keys on the worker FS: remote_structure_file (use the path directly) and inline_structure (materialise an AtomsData dict to a temp XYZ). Embeds full_output back into the result for inline calls so callers do not need remote FS access. - Pre-submit hook _mace_transport_hook centralises the schema -> job-dict conversion, mace_mp -> medium-mpa-0 model normalisation, and inline embedding (when the input file exists on the submitting host). Hook rewrites task.callable from run_mace_single to _mace_worker so the LLM still sees a clean schema-shaped tool. - run_mace_ensemble switches to @schema_fanout_tool with a server-side expander, preserving the directory-driven UX (single LLM call instead of N). Local mode enumerates files via resolve_structure_files; remote mode submits a backend probe to ls remote_structure_directory and builds remote_structure_file per item. - extract_output_json registered via mcp.add_tool() (orchestration, no backend wrap). transfer_files/check_transfer_status/ list_remote_files registered conditionally when get_transfer_manager() finds [execution.globus_transfer] config. - __main__ now wires tracker_kwargs={persist_file: ~/.chemgraph/ mace_jobs.json} so MACE batches survive MCP server restarts. - Drop `from __future__ import annotations`: forward refs break FastMCP's signature introspection because the wrapper's __globals__ is cg_fastmcp's, not the tool module's.
Wraps the Academy distributed agent framework with ChemGraph LLM agents for federated HPC screening workflows. Decoupled from the existing pipeline -- no chemgraph.cli / chemgraph.agent / chemgraph.eval references; only the lazily imported chemgraph.agent.llm_agent.ChemGraph. - ChemGraphAgent: Academy Agent wrapping a single ChemGraph instance, exposes run_query / get_info actions. - ScreeningAgent: iterates a molecule batch, writes per-result JSONs for fault-tolerant aggregation. Failed-molecule records now store str(exc) so the actual exception message survives. - CoordinatorAgent: polls a results dir, optionally analyses results via an LLM, suggests follow-up molecules. - AcademyConfig + build_manager: bridge config.toml to Academy Manager / Exchange / Launcher (local, Redis, Parsl, Globus Compute). - RateLimiter: stdlib async token-bucket for shared per-provider LLM quotas across agents. Lazy imports in __init__.py let the package load without the optional academy-py dependency; ChemGraphAgent / ScreeningAgent / CoordinatorAgent raise ModuleNotFoundError on access if academy-py is missing, while AcademyConfig and RateLimiter remain usable. pyproject's academy optional-dep + pytest marker are already in HEAD (commit 04bcc8a). tests/test_academy.py and scripts/academy_example/ remain untracked and will land in follow-ups.
- globus_transfer.py: disambiguate same-basename inputs with a numeric suffix so two files that share a name (e.g. /a/in.cif and /b/in.cif) don't silently overwrite each other on the remote collection. - job_tracker.py: promote the "no Globus task_id within timeout" message to a warning at submit time, and emit a per-task warning at reload time for batches restored without a task_id (those tasks cannot be queried via the Globus Compute API and would otherwise be silently orphaned across server restarts). - globus_compute_backend.py: catch "executor stopped" exceptions in submit(), rebuild the Executor, and retry once. The previous _ensure_executor relied on the SDK's private _stopped attribute, which fails silently if the SDK exposes the shutdown state differently. - cg_fastmcp.py: wrap _apply_pre_submit_hook in try/except and re-raise hook failures as a ValueError naming the hook and task_id so they surface as a structured tool error instead of an opaque traceback.
Both servers now mirror the mace_mcp_hpc.py pattern:
- CGFastMCP with lazy backend initialisation via init_backend(); the
worker subprocesses re-importing the module no longer instantiate a
backend at import time.
- Job-management tools (check_job_status, get_job_results, list_jobs,
cancel_job, check_endpoint_status) are auto-registered by
CGFastMCP._register_job_tools; the external register_job_tools call
is dropped.
- __main__ wires init_backend(tracker_kwargs={"persist_file": ...})
and pairs run_mcp_server with shutdown_backend in finally. This also
closes a real bug in graspa_mcp_hpc.py, which was instantiating
JobTracker() with no persist_file and silently losing job state
across restarts despite the server's instructions promising
persistence.
- Globus Transfer tools (transfer_files, check_transfer_status,
list_remote_files) are registered on both servers when the transfer
manager is configured, matching the existing MACE behaviour.
- gRASPA expander now supports remote_structure_directory the same way
MACE does: a one-shot probe task lists CIFs on the remote endpoint
and the worker reads them directly from the staged path.
- Ensemble flows use the schema_fanout_tool decorator; per-job structure
metadata is propagated through the worker output (since the framework
meta is only the index).
Legacy *_mcp_parsl.py modules now raise a DeprecationWarning at import
pointing to the *_hpc.py replacement; they remain functional because
scripts/mcp_xanes_example/ still imports xanes_mcp_parsl.
keceli
added a commit
that referenced
this pull request
Jun 2, 2026
- Update ChemGraph container and add Kubernetes deployment. Add Kubernetes deployment support #123 - Add execution layer to ChemGraph, including EnsembleLauncher, Parsl, Globus compute and Globus transfer. Add - pluggable execution backends and backend-agnostic HPC MCP servers #120 Modified the EL backend implemenations, and added a EL backend test #127 - Add initial Academy integration Add pluggable execution backends and backend-agnostic HPC MCP servers #120 - Updates package metadata to version 0.5.0 and fixes source-checkout version reporting so UI deployments do not show unknown. - Modernizes the Streamlit UI with modular pages, improved chat/session behavior, available-calculator display, better math/report rendering, HTML report downloads, and build/host metadata. - Adds calculator availability detection during agent initialization and improves calculator selection, including xTB/TBLite alias handling. - Expands agent workflows with human-in-the-loop support, single-agent routing tests, retry/session fixes, and safer state serialization. - Adds and improves CLI, memory/session persistence, model routing, MCP client support, RAG, XANES, and evaluation tooling. - Adds Kubernetes, GHCR, Streamlit Cloud, HPC, and MCP deployment documentation and examples. Improves CI reliability with dependency pins, Windows serializer fixes, Ruff cleanup, and expanded tests.
…us Compute Introduce a unified execution module with an abstract ExecutionBackend interface and TaskSpec model, supporting four backends: local (ProcessPoolExecutor), Parsl, EnsembleLauncher, and Globus Compute. Includes config factory with resolution order (args > env > config.toml), HPC configs loader, comprehensive tests, and pytest --run-globus-compute option for live endpoint tests.
Remove dead num_nodes=1 after raise in aurora_parsl.py and fix misleading error message. Set _initialized=False at start of EnsembleLauncherBackend.shutdown() to prevent submitting to a partially torn-down backend.
When backend=globus_compute, MCP tools now return immediately after submitting jobs to the remote HPC endpoint instead of blocking until completion. A new JobTracker tracks submitted futures across tool calls, and new MCP tools (check_job_status, get_job_results, list_jobs, cancel_job) let the LLM agent poll for progress and retrieve results. Non-Globus backends (local, Parsl, EnsembleLauncher) are unchanged and continue to block until results are ready. Key changes: - Add is_async_remote property to ExecutionBackend (True for Globus) - Add check_endpoint_status() health check to GlobusComputeBackend - Add JobTracker with batch registration, status, results, cleanup - Add submit_or_gather() utility that branches on backend type - Add optional timeout parameter to gather_futures() - Add register_job_tools() to wire job tools into any MCP server - Integrate tracker into MACE, XANES, and gRASPA MCP servers
- Add CGFastMCP: FastMCP subclass with integrated execution backend, lazy init, built-in job tools, @tool() and @ensemble_tool() decorators - Refactor EnsembleLauncherBackend with client-only mode (shared orchestrator via checkpoint_dir) and managed mode - Update get_backend() to route client_only vs managed EL initialization - Rewrite mace_mcp_hpc.py to use CGFastMCP decorators - Clean up parsl_tools.py: remove dead code, use stdlib logging - Fix __main__ pickle issue via _fix_module_for_pickle + sys.modules alias - Add client-only mode demo cell to notebook 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mport - parsl_tools.run_mace_core: stop swallowing exceptions and returning None. run_ase_core already returns a structured failure dict on simulation errors, and programmer errors should propagate. - cg_fastmcp.ensemble_tool: raise TypeError with a clear message when the decorated function does not have exactly one parameter, instead of crashing with IndexError at decoration time. - ensemble_launcher_backend: soft-import ensemble_launcher and defer the failure to construction / call time. SYSTEM_CONFIG_REGISTRY is now a lazy view backed by builder functions so the module loads cleanly without EL installed, restoring the deferred-error behaviour callers of chemgraph.execution.config expected.
- persist_file parameter: when set, batch metadata and Globus Compute task UUIDs are written to JSON after registration and after results are cached, and loaded on init. Allows MCP servers to recover job state across restarts. - TrackedTask.globus_task_id and TrackedTask.future are both optional; loaded-from-disk batches have no in-memory Future and are queried via the Globus Compute Client directly in get_status. - Lazy Globus Compute Client with a separate gc_lock for thread safety. - _wait_for_globus_task_ids polls each ComputeFuture briefly after submission to capture the Globus task_id assigned asynchronously by the Executor background thread. - cancel_batch / cleanup_old_batches handle the no-future case.
- init_backend now accepts tracker_kwargs= and forwards it to JobTracker(...) in _ensure_backend. Callers can pass persist_file= so MCP servers recover job state across restarts. - set_pre_submit_hook(hook): hook receives each TaskSpec before backend.submit() and returns a (possibly mutated) one. Lets a server centralise transport concerns -- inline-structure embedding for local-submit-to-remote-worker, remote-path rewriting -- instead of repeating that logic in every tool body. Wired into the @tool, @ensemble_tool, and @schema_fanout_tool submit paths. - @schema_fanout_tool(worker=...): the decorated function is an expander (ensemble schema -> list of per-item args). The framework calls worker(item) on the backend for each item and gathers results. Preserves the ensemble schema as the agent-facing API (one tool call, server-side fanout), complementing @ensemble_tool which exposes list[Schema] for callers that want client-side enumeration.
…tent
The file has always contained JSONC-style // comments and is loaded via
_load_jsonc in chemgraph.academy.core.campaign. The .json extension was
making IDEs flag the comments as parse errors. Rename to .jsonc so the
extension matches the content; the package-data glob in pyproject.toml
already includes *.jsonc, so the package install is unaffected.
Also updates the manifest in chemgraph.academy.campaigns.__init__, the
example-002 notes, and tmpfile names in three academy tests for naming
consistency.
Verified:
- chemgraph.academy.campaigns.resolve_campaign returns the new path.
- load_campaign reads it cleanly (5 agents, 1 mcp server).
- tests/test_academy_campaign.py, test_academy_compute_launcher.py,
test_academy_exchange_registration.py: 17 passed.
Adds an optional `allowed_tools` field to ChemGraphAgentSpec that
filters the tools an agent sees from its declared MCP servers. Empty
(the default) keeps todays behavior of exposing every tool the agents
servers advertise. Non-empty restricts the agent to the named tools.
Why: the MCP-server-per-agent contract gates capability at the server
level only. An agent declaring `mcp_servers: ["general"]` sees every
tool that server exposes, even when only one or two are relevant to
its mission. That weakens per-agent sandboxing (mission prompt becomes
the enforcement) and bloats the LangChain tool catalog the LLM has to
choose from.
Changes:
- ChemGraphAgentSpec gains `allowed_tools: tuple[str, ...] = ()`.
- Validator rejects duplicate entries and the case where allowed_tools
is non-empty but mcp_servers is empty.
- MCPServerSupervisor.get_tools accepts allowed_tools: frozenset|None;
when set, tools whose name is not in the whitelist are skipped, and
whitelist entries that match nothing log a warning (so typos surface
without failing the run).
- daemon.run threads agent_spec.allowed_tools through.
- example-002 campaign demonstrates the field: structure agents see
only the SMILES tools, mace-agent sees only run_ase + extract_output_json.
- 4 new tests in test_academy_campaign.py for parse + validation.
- 4 new tests in test_academy_mcp_supervisor.py for filter behavior.
…ACE downloads The in-process MACE path uses mace_mp(model="medium-mpa-0") which downloads its foundation model from GitHub on first use. Aurora and Polaris compute nodes can reach external sites only through the ALCF outbound proxy (proxy.alcf.anl.gov:3128); without these env vars the download hangs and the mace-agent reports failure. Add http_proxy / https_proxy / no_proxy to the compute env block, and to the optional MACE pre-warm snippet, so the documented commands work out of the box on both systems. No code changes.
run_ase_core opens output_results_file for write at the end of the simulation without first ensuring its parent directory exists. Agents and CLI users routinely point the output at a not-yet-created nested subdirectory of a shared run dir; the simulation then runs to completion only to fail with FileNotFoundError: [Errno 2] when it tries to persist results. Compute time wasted, error message blames the wrong layer. Add a single os.makedirs(..., exist_ok=True) on the resolved parent right after the .json extension check. Idempotent, harmless when the directory already exists, and surfaces any permission problem before the calculator gets loaded. Hit this on example-002 polaris run 012: mace-agent received a mace_output_directory resource pointing at academy_mace_outputs/, the agent passed output_results_file as academy_mace_outputs/MOL-002.json, the directory did not exist on disk yet, every run_ase call failed identically, mace-agent retried, kept failing. Test in tests/test_mcp.py mocks load_calculator and runs run_ase_core against a tmp_path / "deeply/nested/output.json" target.
resolve_campaign_resources rewrites shared_run resource paths to absolute locations under <run_dir>/shared/ but never actually creates the directories on disk. Tools that get pointed at one of these resources have to guess whether their parent exists; the in-process run_ase tool, for example, did not, and example-002 polaris run 012 saw every mace-agent call fail with FileNotFoundError for a path under the campaign-declared mace_output_directory. Make academy uphold the natural contract: if a campaign declares a shared_run resource, the runtime guarantees the on-disk parent exists before any agent touches it. Specifically, after resolving the path: - kind: directory -> mkdir -p the resolved path itself - kind: file / json -> mkdir -p the resolved parent The file itself stays the responsibility of the agent that writes it. mkdir is idempotent so per-rank repetition is harmless. Test extends test_campaign_resources_resolve_to_shared_run_artifacts with on-disk assertions, and adds test_resolve_campaign_resources_skips_ non_shared_run_paths confirming we do not create absolute / external paths.
The dashboard launcher previously required a separate `academy` source
checkout on every remote system because it referenced
`${academy_repo_root}/examples/09-polaris-lm-swarm/uan_http_relay.py`
from the system profile. That dependency was undocumented in the
example-002 e2e guide (which only tells users to sync ChemGraph), so a
fresh user on Aurora hit "No such file or directory" trying to start
the Mac-relay path.
Move the relay script into the chemgraph package as a runtime template
(stdlib-only, no imports beyond socket/threading). The dashboard
launcher now materializes it onto the remote at
$REMOTE_ROOT/.chemgraph/uan_http_relay.py before starting the relay,
via a one-line ssh stdin pipe. start_relay accepts the resulting path
as an argument instead of computing it from profile state.
Side cleanup:
* SystemProfile no longer has academy_repo_root; both
aurora.template.json and polaris.template.json drop the field.
* Polaris's relay_host_file used to land inside the academy checkout
(`/academy/uan-relay-18186.host`); normalize to the same shape Aurora
already used: directly under remote_root.
* Dashboard metadata no longer writes academy_repo_root either; nothing
downstream consumed it.
Result: the second `academy` source checkout is no longer required on
remote systems. Users only need ChemGraph synced. The Mac-relay path
works the same way on any new host as long as the chemgraph package is
installed.
102 academy + synth tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ChemGraph Academy persistent agent runtime and dashboard
The MACE MCP server embedded each local structure inline and had the worker re-materialise it to /tmp on every tool call, regardless of backend. This is only needed for Globus Compute, whose workers run on a remote host. For local/parsl/ensemble_launcher backends (shared FS with the server) it was pure overhead -- extra serialization, redundant disk I/O, and a full_output read-back. Add a shares_filesystem capability to ExecutionBackend (True by default; False for Globus Compute, config-overridable via the shares_filesystem kwarg). The MACE transport hook now embeds inline only when the backend does not share the filesystem; the worker already no-ops its inline branch when the key is absent, so shared-FS backends read the input path directly.
The MACE MCP server embedded each local structure inline and had the worker re-materialise it to /tmp on every tool call, regardless of backend. This is only needed for Globus Compute, whose workers run on a remote host. For local/parsl/ensemble_launcher backends (shared FS with the server) it was pure overhead -- extra serialization, redundant disk I/O, and a full_output read-back. Add a shares_filesystem capability to ExecutionBackend (True by default; False for Globus Compute, config-overridable via the shares_filesystem kwarg). The MACE transport hook now embeds inline only when the backend does not share the filesystem; the worker already no-ops its inline branch when the key is absent, so shared-FS backends read the input path directly.
The worker embedded the entire output JSON into the returned result as full_output when an inline structure was used. Results are already persisted to output_result_file, so this just bloated the tool response. Return only what run_mace_core produces; drop the now-unused json import.
The worker embedded the entire output JSON into the returned result as full_output when an inline structure was used. Results are already persisted to output_result_file, so this just bloated the tool response. Return only what run_mace_core produces; drop the now-unused json import.
Under a stdio MCP server, the server's stdout is the JSON-RPC channel.
EnsembleLauncher prints lifecycle notices ("Sent SIGTERM to launcher
process ...") to stdout during orchestrator shutdown, which corrupted the
protocol stream and crashed the client's message parser with a
ValidationError / BrokenResourceError after an otherwise successful run.
Add a fd-level stdout->stderr redirect context manager and wrap the
client/orchestrator teardown calls in shutdown() with it, so the notices
go to stderr instead of the JSON-RPC channel. fd-level dup2 (matching
LocalBackend's worker-stdout guard) catches library/subprocess writes,
not just Python-level sys.stdout.
Layer an fd-level stdout->stderr redirect around shutdown() on top of the subprocess-based orchestrator rework. The orchestrator subprocess already redirects to DEVNULL, but two teardown paths can still print to the JSON-RPC stdout under a stdio MCP server: the in-process client.teardown() and the `el stop` helper (which inherits the parent's stdout). Wrapping the whole shutdown() in the fd guard covers both, preventing the ValidationError / BrokenResourceError crash after an otherwise successful run.
# Conflicts: # src/chemgraph/execution/ensemble_launcher_backend.py
Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes
# Conflicts: # src/chemgraph/hpc_configs/aurora_parsl.py # src/chemgraph/tools/ase_core.py
demo_parsl_in_job_agent.py forced model_name="argo:gpt-5.4" and a local argoapi base_url, mirroring the temp config that had also landed in demo_ensemble_launcher_in_job_agent.py. Restore model selection from the --model flag (the amain(model=...) parameter) and drop the hardcoded base_url so the demo honours the user's chosen model.
TestELSystemConfigCrux asserted the exact Crux SystemConfig shape (ncpus==128, CPU-only) and the registry membership of "crux". These are tied to one specific machine's hardware layout and don't belong in the portable unit-test suite. The polaris references elsewhere are left as-is since they only pass "polaris" as an arbitrary system string to exercise generic GlobusCompute behaviour.
EnsembleLauncher is an optional, HPC-only dependency (not on PyPI for
Python 3.12), so instantiating EnsembleLauncherBackend() raised
ImportError and hard-failed all nine TestELBackend cases in any env
without it. Add pytest.importorskip("ensemble_launcher") in setup_class
so the class skips cleanly, matching the guard already used by the
GlobusCompute tests.
Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
This PR makes ChemGraph's compute layer portable across HPC systems and adds a multi-agent campaign runtime on top of it. It has three parts:
The original
*_parsl.pyservers and thease_coretool layer are untouched.1. Execution backends (
src/chemgraph/execution/)A clean
ExecutionBackendinterface with four implementations:ProcessPoolExecutor, zero external deps, good for dev/CIKey modules:
base.py—ExecutionBackend(ABC) +TaskSpec(unified Python-callable / shell task with resource hints). Backends exposeis_async_remoteandshares_filesystemso callers adapt without knowing the implementation.config.py—get_backend()/get_transfer_manager()factories. Selection hierarchy: explicit arg > env var (CHEMGRAPH_EXECUTION_BACKEND,COMPUTE_SYSTEM) >config.toml[execution]>"local".local_backend.py,parsl_backend.py,ensemble_launcher_backend.py,globus_compute_backend.pyutils.py— structure-file resolution, async future gathering (bridgesconcurrent.futures.Futureto asyncio), JSONL result writingjob_tracker.py— thread-safe tracking of async-remote batches, persisted to JSON so results survive across sessionsglobus_transfer.py—GlobusTransferManagerstages files between collections instead of embedding large structures in payloadshpc_configs/loader.pyreplaces the duplicatedload_parsl_config()helpers, dispatching to per-system factories (local_parsl.py,aurora_parsl.py,polaris_parsl.py,crux_parsl.py) and resolvingworker_initfrom env > submitting env > system fallback.2. HPC MCP servers (
src/chemgraph/mcp/)cg_fastmcp.py—CGFastMCP(FastMCP subclass) submits tool calls to the configured backend asTaskSpecobjects; centralizes transport concerns (inline embedding vs. remote paths) and adds a fanout decorator for ensemble tools.mace_mcp_hpc.py,graspa_mcp_hpc.py,xanes_mcp_hpc.py— buildTaskSpecand submit viaget_backend()instead of calling Parsl directly.job_tools.py— registerscheck_job_status/get_job_results/list_jobs/cancel_job/check_endpoint_statuson any server.transfer_tools.py— registerstransfer_files/check_transfer_status/list_remote_files(Globus staging).hpc_misc_mcp.py— generic JSON artifact inspection.Schemas now support pre-staged HPC inputs:
remote_structure_directoryon the MACE/gRASPA ensemble schemas lets workers read directly from a remote path instead of embedding structures inline.3. Academy multi-agent campaigns (
src/chemgraph/academy/)A runtime for persistent multi-agent screening campaigns over MPI with Redis-backed messaging:
core/— logical agent wrapping the ChemGraph turn primitive, JSONC campaign specs with per-agentallowed_toolswhitelists, peer-messaging action tools.runtime/— MPI daemon, multi-node compute launcher, and a local dashboard launcher that mirrors remote run state and relays the Argo endpoint.observability/— append-only JSONL event log + run-status artifacts (tolerant parsing for live polling).dashboard/— stateless HTTP server + web UI reading from a run directory.For single-agent CLI runs,
--trace-dir(viacli/trace.py) emits the same event schema so runs are viewable in the dashboard. New CLI subcommands wire updashboard,academy run-compute,academy mpi-daemon,academy dashboard, andacademy campaigns.Agent event plumbing was refactored: event translation lives in
agent/events.py, the turn loop (withon_event/terminal_tool_nameshooks) inagent/turn.py, keepingllm_agent.pyclean for CLI users.Supporting changes
models/settings.py(LLMSettings) centralizes endpoint config;openai.pynormalizes Argo model names by endpoint type (local shim vs. hosted wire format, overridable viaCHEMGRAPH_ARGO_MODEL_FORMAT).symbolic_tracecrashes under parallel workers); output parent directories created before writing inase_core.pyandcheminformatics_core.py.aurora_parsl.pyunreachable code + misleading error removed;EnsembleLauncherBackend.shutdown()no longer leaves_initialized=Trueon partial failure.New dependencies (all optional extras)
The local backend works with zero additional installs.
Testing
New suites cover execution backends, job tracking, model normalization, tool-adapter validation, MCP discovery, and the Academy campaign/runtime/dashboard paths:
test_execution.py,test_job_tracker.pytest_openai_model_normalization.py,test_tool_adapter_validation.py,test_mcp.pytest_academy_*.py(campaign, mcp_supervisor, reasoning, dashboard, launcher, exchange, payloads)Live Globus Compute integration tests are opt-in via
--run-globus-compute. Backend-specific machine tests (Polaris/Aurora Parsl, EnsembleLauncher multi-node) are skipped when their dependencies/allocations are unavailable.Test plan
pytest tests/— execution, job-tracker, model, MCP, and Academy suites pass; existing tests greenfrom chemgraph.execution import get_backend, TaskSpec)--run-globus-compute)example-002-mace-ensemble-screening) + dashboard