Skip to content

Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph#132

Merged
tdpham2 merged 32 commits into
dev-globusfrom
dev-globus-hpc
Jun 18, 2026
Merged

Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph#132
tdpham2 merged 32 commits into
dev-globusfrom
dev-globus-hpc

Conversation

@tdpham2

@tdpham2 tdpham2 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Make sure all backends work across ALCF machines and tools

Backend/Systems tested:

  • Parsl + Crux
  • EL + Crux
  • Globus Compute + Crux
  • GlobusTransfer + Crux
  • Parsl + Polaris
  • EL + Polaris
  • Globus Compute + Polaris
  • GlobusTransfer + Polaris
  • Parsl + Aurora
  • EL + Aurora
  • Globus Compute + Aurora
  • GlobusTransfer + Aurora

tdpham2 and others added 8 commits June 4, 2026 08:51
Stdio MCP servers use stdout as the JSON-RPC channel; ProcessPoolExecutor
workers inherit that fd. An unguarded worker print (e.g. mace/tools/cg.py's
"cuequivariance ... will be disabled" notice) corrupts the protocol stream
and aborts the client session on teardown with a JSONRPCMessage
ValidationError.

LocalBackend.initialize now accepts a silence_worker_stdout kwarg and also
reads CHEMGRAPH_LOCAL_SILENCE_STDOUT=1. When set, it passes a module-level
_silence_worker_stdout initializer to ProcessPoolExecutor that runs
os.dup2(stderr_fd, stdout_fd) in each child, so worker prints land on
stderr (logged) instead of the JSON-RPC pipe.

server_utils.run_mcp_server now sets the env var via setdefault before
mcp.run(transport='stdio'), so every stdio-launched MCP server gets the
fix automatically. Override with CHEMGRAPH_LOCAL_SILENCE_STDOUT=0 to
restore raw stdout for debugging.

Default behavior unchanged for direct LocalBackend users (notebooks, CLI):
env var defaults to off; the 9 TestLocalBackend pytest cases still pass.
scripts/smoke/ -- pass/fail validators, one per backend (local,
globus_compute, globus_transfer, parsl_in_job, ensemble_launcher_in_job)
plus a shared _smoke_utils.py and README. Trivial water payload, prints
[PASS]/[FAIL] per check, exits nonzero on failure. Drives the production
get_backend() / GlobusTransferManager / _mace_worker code paths.

scripts/demo/ -- real-chemistry demonstrations, 10 scripts covering each
backend with both direct (no LLM) and agent (LLM + MCP) flavours. Each
demo runs a 5-molecule (H2O, CH4, NH3, CO2, ethanol) MACE driver='thermo'
screen and prints an electronic energy / enthalpy / Gibbs free energy
table plus CSV. Shared _demo_chemistry.py helper handles inline-vs-remote
structure embedding and JSON round-trip.

In-job (Parsl + EnsembleLauncher) scripts target qsub interactive
allocations on Polaris/Aurora; EL scripts include a client-only mode for
the orchestrator-connection pathway added in bc54083.

Validated locally: smoke_local.py 7/7 pass; demo_local_direct.py screens
the 5 molecules (water G=-13.69 eV, ethanol G=-44.91 eV, ethanol lowest);
demo_local_agent.py round-trips via stdio MCP with clean teardown.
Crux is a CPU-only AMD EPYC system (no GPUs), so the new configs drop
accelerators and use a conda-based worker_init. Wires "crux" through
the loader dispatch, the EL SystemConfig registry, and the in-job
smoke/demo allowlists (defaulting device to "cpu" instead of cuda/xpu).

- src/chemgraph/hpc_configs/crux_parsl.py: new HighThroughputExecutor
  config, requires PBS_NODEFILE, max_workers_per_node=16
- src/chemgraph/hpc_configs/loader.py: dispatch "crux" to get_crux_config
- src/chemgraph/execution/ensemble_launcher_backend.py: add
  get_crux_system_config (ncpus=128, no GPUs) and register it
- scripts/smoke/*, scripts/demo/*: accept --system crux and resolve
  device defaults to cpu for Crux
- tests/test_execution.py: TestELSystemConfigCrux asserts registry
  membership and CPU-only SystemConfig shape
Closes the submitter PicklingError / worker AttributeError on
run_mace_singleArguments by making FastMCP's dynamic <tool>Arguments
and <tool>Output classes picklable by qualname, and by making
top-level MCP-server callables pickle by reference even under the
runpy double-module case (sys.modules["__main__"] vs
sys.modules["pkg.mod"] are distinct objects when launched with
python -m, and the leaf module isn't attached to its parent package).

- cg_fastmcp.py: _register_fastmcp_dynamic_models() injects dynamic
  arg/output models into the func_metadata module namespace and
  rebinds the captured local in tools.base / prompts.base /
  resources.templates. _fix_module_for_pickle now also sets the
  function attr on the resolved target module and attaches the leaf
  module to its parent package, so dill's by-qualname lookup
  succeeds. Backend wrappers route args/kwargs through to_picklable.
- mace_mcp_hpc.py: apply _fix_module_for_pickle to _mace_worker and
  _ls_remote_files; debug-log the transport hook.
- execution/utils.py: add to_picklable() helper that recursively
  serializes Pydantic BaseModel instances via model_dump().
- execution/parsl_backend.py: wrap task.args / task.kwargs with
  to_picklable() before dispatching to the python app.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
parsl.clear() only removes the DFK from the global registry; it does
not stop executors. Without parsl.dfk().cleanup(), Parsl logs
"Python is exiting with a DFK still running" at interpreter exit and
relies on atexit hooks for executor teardown. Call cleanup() before
clear() and log (but do not raise) on cleanup failure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds resolve_worker_init(run_dir, fallback) in hpc_configs/loader.py
with three-tier precedence: CHEMGRAPH_WORKER_INIT env override →
submitter env auto-detect (VIRTUAL_ENV, then CONDA_PREFIX) → caller-
provided per-system fallback. Every config now accepts an optional
worker_init kwarg and routes through this helper so Parsl workers
land in the same Python environment as the submitter without
requiring code edits per HPC system.

Per-system fallbacks: Crux "module load conda; conda activate base",
Aurora "module load frameworks", Polaris/Local "true".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- demo_parsl_in_job_agent.py: accept crux as a supported system
  (default device=cpu); forward VIRTUAL_ENV, CONDA_PREFIX,
  CONDA_DEFAULT_ENV, CHEMGRAPH_WORKER_INIT, PBS_NODEFILE, and
  PBS_O_WORKDIR to the MCP stdio subprocess so the Parsl workers
  re-activate the submitter's Python env.
- _demo_chemistry.py: include wall-time column in the agent prompt.
- README.md: document the Crux PBS workflow.
- run_crux_demo.sh: PBS-side wrapper that activates the venv and
  invokes the Parsl + EnsembleLauncher demos with system=crux.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- _smoke_utils.py: add ensure_on_worker_pythonpath() so Parsl
  workers can import _smoke_utils from the script directory.
- smoke_parsl_in_job.py / smoke_ensemble_launcher_in_job.py: call
  ensure_on_worker_pythonpath() at import time.
- README.md: document the Crux PBS workflow.
- run_crux_smoke.sh: PBS-side wrapper that activates the venv and
  runs both smoke entrypoints with system=crux.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tdpham2 tdpham2 self-assigned this Jun 11, 2026
tdpham2 and others added 21 commits June 12, 2026 20:38
`get_launcher_config`'s default `mpi_flavour="test"` only works for
single-host runs: its `write_file_to_nodes` does not actually distribute
the per-child JSON spec to remote `/tmp`, so when `main.mN` tries to
launch a worker on a different node it dies with `FileNotFoundError`
on `/tmp/.mpiexec_tmp/child_<uuid>.json` and the demo hangs.

Flip the default to `"mpich"` (hydra `mpiexec`), widen the Literal to
cover every flavour EL knows about, and pick the right one per system
in `get_backend("ensemble_launcher", system=...)`: `aurora`/`polaris`/
`crux` → `"mpich"`, `local` → `"test"`. An explicit
`[execution.ensemble_launcher] mpi_flavour` in `config.toml` still
overrides.
The existing `threading.Lock` in `mace_calc._mace_lock` only protected
threads inside one process. The EnsembleLauncher `async_processpool`
spawns multiple Python workers in parallel on the same node, so
sibling processes raced on the torch.load + symbolic_trace path
(see issue #110) and tripped the same NameError / hang the lock was
introduced to prevent.

Add `mace_loading_lock()`, a context manager that holds the existing
in-process lock and an `fcntl.flock` on a per-uid lockfile under
`$CHEMGRAPH_MACE_LOCK_DIR` → `$TMPDIR` → `tempfile.gettempdir()` →
`~/.cache/chemgraph`. Degrades gracefully to thread-only locking
where `fcntl` or no writable directory is available.

Move the lock acquisition into `load_calculator` so every entry path
(Parsl, EnsembleLauncher, local, agent) is covered, not just
`ase_tools.run_ase`. Drop the now-redundant `with _mace_lock:` in
`run_ase`.
The MACE MCP server embedded each local structure inline and had the
worker re-materialise it to /tmp on every tool call, regardless of
backend. This is only needed for Globus Compute, whose workers run on a
remote host. For local/parsl/ensemble_launcher backends (shared FS with
the server) it was pure overhead -- extra serialization, redundant disk
I/O, and a full_output read-back.

Add a shares_filesystem capability to ExecutionBackend (True by default;
False for Globus Compute, config-overridable via the shares_filesystem
kwarg). The MACE transport hook now embeds inline only when the backend
does not share the filesystem; the worker already no-ops its inline
branch when the key is absent, so shared-FS backends read the input path
directly.
The MACE MCP server embedded each local structure inline and had the
worker re-materialise it to /tmp on every tool call, regardless of
backend. This is only needed for Globus Compute, whose workers run on a
remote host. For local/parsl/ensemble_launcher backends (shared FS with
the server) it was pure overhead -- extra serialization, redundant disk
I/O, and a full_output read-back.

Add a shares_filesystem capability to ExecutionBackend (True by default;
False for Globus Compute, config-overridable via the shares_filesystem
kwarg). The MACE transport hook now embeds inline only when the backend
does not share the filesystem; the worker already no-ops its inline
branch when the key is absent, so shared-FS backends read the input path
directly.
The worker embedded the entire output JSON into the returned result as
full_output when an inline structure was used. Results are already
persisted to output_result_file, so this just bloated the tool response.
Return only what run_mace_core produces; drop the now-unused json import.
The worker embedded the entire output JSON into the returned result as
full_output when an inline structure was used. Results are already
persisted to output_result_file, so this just bloated the tool response.
Return only what run_mace_core produces; drop the now-unused json import.
Under a stdio MCP server, the server's stdout is the JSON-RPC channel.
EnsembleLauncher prints lifecycle notices ("Sent SIGTERM to launcher
process ...") to stdout during orchestrator shutdown, which corrupted the
protocol stream and crashed the client's message parser with a
ValidationError / BrokenResourceError after an otherwise successful run.

Add a fd-level stdout->stderr redirect context manager and wrap the
client/orchestrator teardown calls in shutdown() with it, so the notices
go to stderr instead of the JSON-RPC channel. fd-level dup2 (matching
LocalBackend's worker-stdout guard) catches library/subprocess writes,
not just Python-level sys.stdout.
Layer an fd-level stdout->stderr redirect around shutdown() on top of the
subprocess-based orchestrator rework. The orchestrator subprocess already
redirects to DEVNULL, but two teardown paths can still print to the
JSON-RPC stdout under a stdio MCP server: the in-process client.teardown()
and the `el stop` helper (which inherits the parent's stdout). Wrapping
the whole shutdown() in the fd guard covers both, preventing the
ValidationError / BrokenResourceError crash after an otherwise successful
run.
# Conflicts:
#	src/chemgraph/execution/ensemble_launcher_backend.py
Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes
# Conflicts:
#	src/chemgraph/hpc_configs/aurora_parsl.py
#	src/chemgraph/tools/ase_core.py
tdpham2 added 3 commits June 17, 2026 23:35
demo_parsl_in_job_agent.py forced model_name="argo:gpt-5.4" and a
local argoapi base_url, mirroring the temp config that had also landed
in demo_ensemble_launcher_in_job_agent.py. Restore model selection from
the --model flag (the amain(model=...) parameter) and drop the hardcoded
base_url so the demo honours the user's chosen model.
TestELSystemConfigCrux asserted the exact Crux SystemConfig shape
(ncpus==128, CPU-only) and the registry membership of "crux". These
are tied to one specific machine's hardware layout and don't belong in
the portable unit-test suite. The polaris references elsewhere are left
as-is since they only pass "polaris" as an arbitrary system string to
exercise generic GlobusCompute behaviour.
EnsembleLauncher is an optional, HPC-only dependency (not on PyPI for
Python 3.12), so instantiating EnsembleLauncherBackend() raised
ImportError and hard-failed all nine TestELBackend cases in any env
without it. Add pytest.importorskip("ensemble_launcher") in setup_class
so the class skips cleanly, matching the guard already used by the
GlobusCompute tests.
@tdpham2 tdpham2 merged commit 3a541a4 into dev-globus Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants