Update existing backends (Parsl, EL, Globus Compute, Globus Transfer) to ChemGraph#132
Merged
Conversation
Stdio MCP servers use stdout as the JSON-RPC channel; ProcessPoolExecutor workers inherit that fd. An unguarded worker print (e.g. mace/tools/cg.py's "cuequivariance ... will be disabled" notice) corrupts the protocol stream and aborts the client session on teardown with a JSONRPCMessage ValidationError. LocalBackend.initialize now accepts a silence_worker_stdout kwarg and also reads CHEMGRAPH_LOCAL_SILENCE_STDOUT=1. When set, it passes a module-level _silence_worker_stdout initializer to ProcessPoolExecutor that runs os.dup2(stderr_fd, stdout_fd) in each child, so worker prints land on stderr (logged) instead of the JSON-RPC pipe. server_utils.run_mcp_server now sets the env var via setdefault before mcp.run(transport='stdio'), so every stdio-launched MCP server gets the fix automatically. Override with CHEMGRAPH_LOCAL_SILENCE_STDOUT=0 to restore raw stdout for debugging. Default behavior unchanged for direct LocalBackend users (notebooks, CLI): env var defaults to off; the 9 TestLocalBackend pytest cases still pass.
scripts/smoke/ -- pass/fail validators, one per backend (local, globus_compute, globus_transfer, parsl_in_job, ensemble_launcher_in_job) plus a shared _smoke_utils.py and README. Trivial water payload, prints [PASS]/[FAIL] per check, exits nonzero on failure. Drives the production get_backend() / GlobusTransferManager / _mace_worker code paths. scripts/demo/ -- real-chemistry demonstrations, 10 scripts covering each backend with both direct (no LLM) and agent (LLM + MCP) flavours. Each demo runs a 5-molecule (H2O, CH4, NH3, CO2, ethanol) MACE driver='thermo' screen and prints an electronic energy / enthalpy / Gibbs free energy table plus CSV. Shared _demo_chemistry.py helper handles inline-vs-remote structure embedding and JSON round-trip. In-job (Parsl + EnsembleLauncher) scripts target qsub interactive allocations on Polaris/Aurora; EL scripts include a client-only mode for the orchestrator-connection pathway added in bc54083. Validated locally: smoke_local.py 7/7 pass; demo_local_direct.py screens the 5 molecules (water G=-13.69 eV, ethanol G=-44.91 eV, ethanol lowest); demo_local_agent.py round-trips via stdio MCP with clean teardown.
Crux is a CPU-only AMD EPYC system (no GPUs), so the new configs drop accelerators and use a conda-based worker_init. Wires "crux" through the loader dispatch, the EL SystemConfig registry, and the in-job smoke/demo allowlists (defaulting device to "cpu" instead of cuda/xpu). - src/chemgraph/hpc_configs/crux_parsl.py: new HighThroughputExecutor config, requires PBS_NODEFILE, max_workers_per_node=16 - src/chemgraph/hpc_configs/loader.py: dispatch "crux" to get_crux_config - src/chemgraph/execution/ensemble_launcher_backend.py: add get_crux_system_config (ncpus=128, no GPUs) and register it - scripts/smoke/*, scripts/demo/*: accept --system crux and resolve device defaults to cpu for Crux - tests/test_execution.py: TestELSystemConfigCrux asserts registry membership and CPU-only SystemConfig shape
Closes the submitter PicklingError / worker AttributeError on run_mace_singleArguments by making FastMCP's dynamic <tool>Arguments and <tool>Output classes picklable by qualname, and by making top-level MCP-server callables pickle by reference even under the runpy double-module case (sys.modules["__main__"] vs sys.modules["pkg.mod"] are distinct objects when launched with python -m, and the leaf module isn't attached to its parent package). - cg_fastmcp.py: _register_fastmcp_dynamic_models() injects dynamic arg/output models into the func_metadata module namespace and rebinds the captured local in tools.base / prompts.base / resources.templates. _fix_module_for_pickle now also sets the function attr on the resolved target module and attaches the leaf module to its parent package, so dill's by-qualname lookup succeeds. Backend wrappers route args/kwargs through to_picklable. - mace_mcp_hpc.py: apply _fix_module_for_pickle to _mace_worker and _ls_remote_files; debug-log the transport hook. - execution/utils.py: add to_picklable() helper that recursively serializes Pydantic BaseModel instances via model_dump(). - execution/parsl_backend.py: wrap task.args / task.kwargs with to_picklable() before dispatching to the python app. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
parsl.clear() only removes the DFK from the global registry; it does not stop executors. Without parsl.dfk().cleanup(), Parsl logs "Python is exiting with a DFK still running" at interpreter exit and relies on atexit hooks for executor teardown. Call cleanup() before clear() and log (but do not raise) on cleanup failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds resolve_worker_init(run_dir, fallback) in hpc_configs/loader.py with three-tier precedence: CHEMGRAPH_WORKER_INIT env override → submitter env auto-detect (VIRTUAL_ENV, then CONDA_PREFIX) → caller- provided per-system fallback. Every config now accepts an optional worker_init kwarg and routes through this helper so Parsl workers land in the same Python environment as the submitter without requiring code edits per HPC system. Per-system fallbacks: Crux "module load conda; conda activate base", Aurora "module load frameworks", Polaris/Local "true". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- demo_parsl_in_job_agent.py: accept crux as a supported system (default device=cpu); forward VIRTUAL_ENV, CONDA_PREFIX, CONDA_DEFAULT_ENV, CHEMGRAPH_WORKER_INIT, PBS_NODEFILE, and PBS_O_WORKDIR to the MCP stdio subprocess so the Parsl workers re-activate the submitter's Python env. - _demo_chemistry.py: include wall-time column in the agent prompt. - README.md: document the Crux PBS workflow. - run_crux_demo.sh: PBS-side wrapper that activates the venv and invokes the Parsl + EnsembleLauncher demos with system=crux. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- _smoke_utils.py: add ensure_on_worker_pythonpath() so Parsl workers can import _smoke_utils from the script directory. - smoke_parsl_in_job.py / smoke_ensemble_launcher_in_job.py: call ensure_on_worker_pythonpath() at import time. - README.md: document the Crux PBS workflow. - run_crux_smoke.sh: PBS-side wrapper that activates the venv and runs both smoke entrypoints with system=crux. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`get_launcher_config`'s default `mpi_flavour="test"` only works for
single-host runs: its `write_file_to_nodes` does not actually distribute
the per-child JSON spec to remote `/tmp`, so when `main.mN` tries to
launch a worker on a different node it dies with `FileNotFoundError`
on `/tmp/.mpiexec_tmp/child_<uuid>.json` and the demo hangs.
Flip the default to `"mpich"` (hydra `mpiexec`), widen the Literal to
cover every flavour EL knows about, and pick the right one per system
in `get_backend("ensemble_launcher", system=...)`: `aurora`/`polaris`/
`crux` → `"mpich"`, `local` → `"test"`. An explicit
`[execution.ensemble_launcher] mpi_flavour` in `config.toml` still
overrides.
The existing `threading.Lock` in `mace_calc._mace_lock` only protected threads inside one process. The EnsembleLauncher `async_processpool` spawns multiple Python workers in parallel on the same node, so sibling processes raced on the torch.load + symbolic_trace path (see issue #110) and tripped the same NameError / hang the lock was introduced to prevent. Add `mace_loading_lock()`, a context manager that holds the existing in-process lock and an `fcntl.flock` on a per-uid lockfile under `$CHEMGRAPH_MACE_LOCK_DIR` → `$TMPDIR` → `tempfile.gettempdir()` → `~/.cache/chemgraph`. Degrades gracefully to thread-only locking where `fcntl` or no writable directory is available. Move the lock acquisition into `load_calculator` so every entry path (Parsl, EnsembleLauncher, local, agent) is covered, not just `ase_tools.run_ase`. Drop the now-redundant `with _mace_lock:` in `run_ase`.
The MACE MCP server embedded each local structure inline and had the worker re-materialise it to /tmp on every tool call, regardless of backend. This is only needed for Globus Compute, whose workers run on a remote host. For local/parsl/ensemble_launcher backends (shared FS with the server) it was pure overhead -- extra serialization, redundant disk I/O, and a full_output read-back. Add a shares_filesystem capability to ExecutionBackend (True by default; False for Globus Compute, config-overridable via the shares_filesystem kwarg). The MACE transport hook now embeds inline only when the backend does not share the filesystem; the worker already no-ops its inline branch when the key is absent, so shared-FS backends read the input path directly.
The MACE MCP server embedded each local structure inline and had the worker re-materialise it to /tmp on every tool call, regardless of backend. This is only needed for Globus Compute, whose workers run on a remote host. For local/parsl/ensemble_launcher backends (shared FS with the server) it was pure overhead -- extra serialization, redundant disk I/O, and a full_output read-back. Add a shares_filesystem capability to ExecutionBackend (True by default; False for Globus Compute, config-overridable via the shares_filesystem kwarg). The MACE transport hook now embeds inline only when the backend does not share the filesystem; the worker already no-ops its inline branch when the key is absent, so shared-FS backends read the input path directly.
The worker embedded the entire output JSON into the returned result as full_output when an inline structure was used. Results are already persisted to output_result_file, so this just bloated the tool response. Return only what run_mace_core produces; drop the now-unused json import.
The worker embedded the entire output JSON into the returned result as full_output when an inline structure was used. Results are already persisted to output_result_file, so this just bloated the tool response. Return only what run_mace_core produces; drop the now-unused json import.
Under a stdio MCP server, the server's stdout is the JSON-RPC channel.
EnsembleLauncher prints lifecycle notices ("Sent SIGTERM to launcher
process ...") to stdout during orchestrator shutdown, which corrupted the
protocol stream and crashed the client's message parser with a
ValidationError / BrokenResourceError after an otherwise successful run.
Add a fd-level stdout->stderr redirect context manager and wrap the
client/orchestrator teardown calls in shutdown() with it, so the notices
go to stderr instead of the JSON-RPC channel. fd-level dup2 (matching
LocalBackend's worker-stdout guard) catches library/subprocess writes,
not just Python-level sys.stdout.
Layer an fd-level stdout->stderr redirect around shutdown() on top of the subprocess-based orchestrator rework. The orchestrator subprocess already redirects to DEVNULL, but two teardown paths can still print to the JSON-RPC stdout under a stdio MCP server: the in-process client.teardown() and the `el stop` helper (which inherits the parent's stdout). Wrapping the whole shutdown() in the fd guard covers both, preventing the ValidationError / BrokenResourceError crash after an otherwise successful run.
# Conflicts: # src/chemgraph/execution/ensemble_launcher_backend.py
Merge el_test into dev-globus-hpc: EL subprocess rework + HPC fixes
# Conflicts: # src/chemgraph/hpc_configs/aurora_parsl.py # src/chemgraph/tools/ase_core.py
demo_parsl_in_job_agent.py forced model_name="argo:gpt-5.4" and a local argoapi base_url, mirroring the temp config that had also landed in demo_ensemble_launcher_in_job_agent.py. Restore model selection from the --model flag (the amain(model=...) parameter) and drop the hardcoded base_url so the demo honours the user's chosen model.
TestELSystemConfigCrux asserted the exact Crux SystemConfig shape (ncpus==128, CPU-only) and the registry membership of "crux". These are tied to one specific machine's hardware layout and don't belong in the portable unit-test suite. The polaris references elsewhere are left as-is since they only pass "polaris" as an arbitrary system string to exercise generic GlobusCompute behaviour.
EnsembleLauncher is an optional, HPC-only dependency (not on PyPI for
Python 3.12), so instantiating EnsembleLauncherBackend() raised
ImportError and hard-failed all nine TestELBackend cases in any env
without it. Add pytest.importorskip("ensemble_launcher") in setup_class
so the class skips cleanly, matching the guard already used by the
GlobusCompute tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make sure all backends work across ALCF machines and tools
Backend/Systems tested: