Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph by tdpham2 · Pull Request #132 · argonne-lcf/ChemGraph

tdpham2 · 2026-06-11T19:33:11Z

Make sure all backends work across ALCF machines and tools

Backend/Systems tested:

Stdio MCP servers use stdout as the JSON-RPC channel; ProcessPoolExecutor workers inherit that fd. An unguarded worker print (e.g. mace/tools/cg.py's "cuequivariance ... will be disabled" notice) corrupts the protocol stream and aborts the client session on teardown with a JSONRPCMessage ValidationError. LocalBackend.initialize now accepts a silence_worker_stdout kwarg and also reads CHEMGRAPH_LOCAL_SILENCE_STDOUT=1. When set, it passes a module-level _silence_worker_stdout initializer to ProcessPoolExecutor that runs os.dup2(stderr_fd, stdout_fd) in each child, so worker prints land on stderr (logged) instead of the JSON-RPC pipe. server_utils.run_mcp_server now sets the env var via setdefault before mcp.run(transport='stdio'), so every stdio-launched MCP server gets the fix automatically. Override with CHEMGRAPH_LOCAL_SILENCE_STDOUT=0 to restore raw stdout for debugging. Default behavior unchanged for direct LocalBackend users (notebooks, CLI): env var defaults to off; the 9 TestLocalBackend pytest cases still pass.

scripts/smoke/ -- pass/fail validators, one per backend (local, globus_compute, globus_transfer, parsl_in_job, ensemble_launcher_in_job) plus a shared _smoke_utils.py and README. Trivial water payload, prints [PASS]/[FAIL] per check, exits nonzero on failure. Drives the production get_backend() / GlobusTransferManager / _mace_worker code paths. scripts/demo/ -- real-chemistry demonstrations, 10 scripts covering each backend with both direct (no LLM) and agent (LLM + MCP) flavours. Each demo runs a 5-molecule (H2O, CH4, NH3, CO2, ethanol) MACE driver='thermo' screen and prints an electronic energy / enthalpy / Gibbs free energy table plus CSV. Shared _demo_chemistry.py helper handles inline-vs-remote structure embedding and JSON round-trip. In-job (Parsl + EnsembleLauncher) scripts target qsub interactive allocations on Polaris/Aurora; EL scripts include a client-only mode for the orchestrator-connection pathway added in bc54083. Validated locally: smoke_local.py 7/7 pass; demo_local_direct.py screens the 5 molecules (water G=-13.69 eV, ethanol G=-44.91 eV, ethanol lowest); demo_local_agent.py round-trips via stdio MCP with clean teardown.

Crux is a CPU-only AMD EPYC system (no GPUs), so the new configs drop accelerators and use a conda-based worker_init. Wires "crux" through the loader dispatch, the EL SystemConfig registry, and the in-job smoke/demo allowlists (defaulting device to "cpu" instead of cuda/xpu). - src/chemgraph/hpc_configs/crux_parsl.py: new HighThroughputExecutor config, requires PBS_NODEFILE, max_workers_per_node=16 - src/chemgraph/hpc_configs/loader.py: dispatch "crux" to get_crux_config - src/chemgraph/execution/ensemble_launcher_backend.py: add get_crux_system_config (ncpus=128, no GPUs) and register it - scripts/smoke/*, scripts/demo/*: accept --system crux and resolve device defaults to cpu for Crux - tests/test_execution.py: TestELSystemConfigCrux asserts registry membership and CPU-only SystemConfig shape

Closes the submitter PicklingError / worker AttributeError on run_mace_singleArguments by making FastMCP's dynamic <tool>Arguments and <tool>Output classes picklable by qualname, and by making top-level MCP-server callables pickle by reference even under the runpy double-module case (sys.modules["__main__"] vs sys.modules["pkg.mod"] are distinct objects when launched with python -m, and the leaf module isn't attached to its parent package). - cg_fastmcp.py: _register_fastmcp_dynamic_models() injects dynamic arg/output models into the func_metadata module namespace and rebinds the captured local in tools.base / prompts.base / resources.templates. _fix_module_for_pickle now also sets the function attr on the resolved target module and attaches the leaf module to its parent package, so dill's by-qualname lookup succeeds. Backend wrappers route args/kwargs through to_picklable. - mace_mcp_hpc.py: apply _fix_module_for_pickle to _mace_worker and _ls_remote_files; debug-log the transport hook. - execution/utils.py: add to_picklable() helper that recursively serializes Pydantic BaseModel instances via model_dump(). - execution/parsl_backend.py: wrap task.args / task.kwargs with to_picklable() before dispatching to the python app. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

parsl.clear() only removes the DFK from the global registry; it does not stop executors. Without parsl.dfk().cleanup(), Parsl logs "Python is exiting with a DFK still running" at interpreter exit and relies on atexit hooks for executor teardown. Call cleanup() before clear() and log (but do not raise) on cleanup failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds resolve_worker_init(run_dir, fallback) in hpc_configs/loader.py with three-tier precedence: CHEMGRAPH_WORKER_INIT env override → submitter env auto-detect (VIRTUAL_ENV, then CONDA_PREFIX) → caller- provided per-system fallback. Every config now accepts an optional worker_init kwarg and routes through this helper so Parsl workers land in the same Python environment as the submitter without requiring code edits per HPC system. Per-system fallbacks: Crux "module load conda; conda activate base", Aurora "module load frameworks", Polaris/Local "true". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- demo_parsl_in_job_agent.py: accept crux as a supported system (default device=cpu); forward VIRTUAL_ENV, CONDA_PREFIX, CONDA_DEFAULT_ENV, CHEMGRAPH_WORKER_INIT, PBS_NODEFILE, and PBS_O_WORKDIR to the MCP stdio subprocess so the Parsl workers re-activate the submitter's Python env. - _demo_chemistry.py: include wall-time column in the agent prompt. - README.md: document the Crux PBS workflow. - run_crux_demo.sh: PBS-side wrapper that activates the venv and invokes the Parsl + EnsembleLauncher demos with system=crux. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- _smoke_utils.py: add ensure_on_worker_pythonpath() so Parsl workers can import _smoke_utils from the script directory. - smoke_parsl_in_job.py / smoke_ensemble_launcher_in_job.py: call ensure_on_worker_pythonpath() at import time. - README.md: document the Crux PBS workflow. - run_crux_smoke.sh: PBS-side wrapper that activates the venv and runs both smoke entrypoints with system=crux. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`get_launcher_config`'s default `mpi_flavour="test"` only works for single-host runs: its `write_file_to_nodes` does not actually distribute the per-child JSON spec to remote `/tmp`, so when `main.mN` tries to launch a worker on a different node it dies with `FileNotFoundError` on `/tmp/.mpiexec_tmp/child_<uuid>.json` and the demo hangs. Flip the default to `"mpich"` (hydra `mpiexec`), widen the Literal to cover every flavour EL knows about, and pick the right one per system in `get_backend("ensemble_launcher", system=...)`: `aurora`/`polaris`/ `crux` → `"mpich"`, `local` → `"test"`. An explicit `[execution.ensemble_launcher] mpi_flavour` in `config.toml` still overrides.

The existing `threading.Lock` in `mace_calc._mace_lock` only protected threads inside one process. The EnsembleLauncher `async_processpool` spawns multiple Python workers in parallel on the same node, so sibling processes raced on the torch.load + symbolic_trace path (see issue #110) and tripped the same NameError / hang the lock was introduced to prevent. Add `mace_loading_lock()`, a context manager that holds the existing in-process lock and an `fcntl.flock` on a per-uid lockfile under `$CHEMGRAPH_MACE_LOCK_DIR` → `$TMPDIR` → `tempfile.gettempdir()` → `~/.cache/chemgraph`. Degrades gracefully to thread-only locking where `fcntl` or no writable directory is available. Move the lock acquisition into `load_calculator` so every entry path (Parsl, EnsembleLauncher, local, agent) is covered, not just `ase_tools.run_ase`. Drop the now-redundant `with _mace_lock:` in `run_ase`.

The MACE MCP server embedded each local structure inline and had the worker re-materialise it to /tmp on every tool call, regardless of backend. This is only needed for Globus Compute, whose workers run on a remote host. For local/parsl/ensemble_launcher backends (shared FS with the server) it was pure overhead -- extra serialization, redundant disk I/O, and a full_output read-back. Add a shares_filesystem capability to ExecutionBackend (True by default; False for Globus Compute, config-overridable via the shares_filesystem kwarg). The MACE transport hook now embeds inline only when the backend does not share the filesystem; the worker already no-ops its inline branch when the key is absent, so shared-FS backends read the input path directly.

The worker embedded the entire output JSON into the returned result as full_output when an inline structure was used. Results are already persisted to output_result_file, so this just bloated the tool response. Return only what run_mace_core produces; drop the now-unused json import.

Under a stdio MCP server, the server's stdout is the JSON-RPC channel. EnsembleLauncher prints lifecycle notices ("Sent SIGTERM to launcher process ...") to stdout during orchestrator shutdown, which corrupted the protocol stream and crashed the client's message parser with a ValidationError / BrokenResourceError after an otherwise successful run. Add a fd-level stdout->stderr redirect context manager and wrap the client/orchestrator teardown calls in shutdown() with it, so the notices go to stderr instead of the JSON-RPC channel. fd-level dup2 (matching LocalBackend's worker-stdout guard) catches library/subprocess writes, not just Python-level sys.stdout.

Layer an fd-level stdout->stderr redirect around shutdown() on top of the subprocess-based orchestrator rework. The orchestrator subprocess already redirects to DEVNULL, but two teardown paths can still print to the JSON-RPC stdout under a stdio MCP server: the in-process client.teardown() and the `el stop` helper (which inherits the parent's stdout). Wrapping the whole shutdown() in the fd guard covers both, preventing the ValidationError / BrokenResourceError crash after an otherwise successful run.

# Conflicts: # src/chemgraph/execution/ensemble_launcher_backend.py

Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes

# Conflicts: # src/chemgraph/hpc_configs/aurora_parsl.py # src/chemgraph/tools/ase_core.py

demo_parsl_in_job_agent.py forced model_name="argo:gpt-5.4" and a local argoapi base_url, mirroring the temp config that had also landed in demo_ensemble_launcher_in_job_agent.py. Restore model selection from the --model flag (the amain(model=...) parameter) and drop the hardcoded base_url so the demo honours the user's chosen model.

TestELSystemConfigCrux asserted the exact Crux SystemConfig shape (ncpus==128, CPU-only) and the registry membership of "crux". These are tied to one specific machine's hardware layout and don't belong in the portable unit-test suite. The polaris references elsewhere are left as-is since they only pass "polaris" as an arbitrary system string to exercise generic GlobusCompute behaviour.

EnsembleLauncher is an optional, HPC-only dependency (not on PyPI for Python 3.12), so instantiating EnsembleLauncherBackend() raised ImportError and hard-failed all nine TestELBackend cases in any env without it. Add pytest.importorskip("ensemble_launcher") in setup_class so the class skips cleanly, matching the guard already used by the GlobusCompute tests.

tdpham2 and others added 8 commits June 4, 2026 08:51

tdpham2 self-assigned this Jun 11, 2026

tdpham2 and others added 21 commits June 12, 2026 20:38

fix static sync location

8043fe9

added a one off logging to run_ase_core

a2e27a9

added ppn to task spec in demo el

f56713d

added try except block in demo chemistry

aefed79

added -ppn and --ngpus_per_process to mcp demos

64d087a

added a counter in cg mcp to make task_ids unique

09f972b

adding some temp cg config dor argo

507d518

moved el orchestrator to a subprocess

debdab4

added logging in el backend

b43d390

added better cleanup of el subprocess

3feedcd

Merge remote-tracking branch 'origin/dev-globus-hpc' into el_test

855be82

# Conflicts: # src/chemgraph/execution/ensemble_launcher_backend.py

Merge pull request #133 from argonne-lcf/el_test

7ec8dec

Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes

Merge remote-tracking branch 'origin/dev-globus' into dev-globus-hpc

6741d2e

# Conflicts: # src/chemgraph/hpc_configs/aurora_parsl.py # src/chemgraph/tools/ase_core.py

tdpham2 added 3 commits June 17, 2026 23:35

tdpham2 merged commit 3a541a4 into dev-globus Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph#132

Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph#132
tdpham2 merged 32 commits into
dev-globusfrom
dev-globus-hpc

tdpham2 commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tdpham2 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tdpham2 commented Jun 11, 2026 •

edited

Loading