Skip to content

Publish capability snapshots to the relay for Galaxy BYOC#458

Merged
jmchilton merged 2 commits into
galaxyproject:masterfrom
mvdbeek:pulsar-capabilities
May 26, 2026
Merged

Publish capability snapshots to the relay for Galaxy BYOC#458
jmchilton merged 2 commits into
galaxyproject:masterfrom
mvdbeek:pulsar-capabilities

Conversation

@mvdbeek

@mvdbeek mvdbeek commented May 15, 2026

Copy link
Copy Markdown
Member

Summary

Adds a one-shot capability snapshot that the Pulsar daemon publishes to its message-queue relay on startup. Galaxy's "bring your own compute" (BYOC) integration uses it to:

  • auto-fill destination params (staging_directory, runtime flags, manager type) when a user bootstraps a new resource, and
  • downgrade per-job requests at dispatch time when the operator-configured destination asks for something the remote daemon doesn't actually have (e.g. singularity_enabled: true but no singularity on $PATH).

The publish is fire-and-forget, gated by the new message_queue_publish_capabilities setting (default True), and runs once per manager during messaging.bind_app. Failures are logged and swallowed — capabilities are advisory and must not block manager bind.

What the snapshot contains (schema_version=1)

Collected once during PulsarApp.__init__ from existing pulsar state, no external probes beyond shutil.which:

  • pulsar_version, manager_name, manager_type, num_concurrent_jobs, work_threads
  • staging_directory, persistence_directory, tool_dependency_dir
  • dependency_resolvers: list of {resolver_type, versionless, prefix?, ensure_channels?} for each configured resolver (conda fields read through isinstance(r, CondaDependencyResolver) rather than string-typed probes)
  • container_runtime: {docker_available, singularity_available, apptainer_available} from shutil.which
  • conda_available: convenience flag derived from the resolver list

Published once to pulsar_capabilities (or <prefix>_pulsar_capabilities[_<manager>] for non-default managers, mirroring __make_capabilities_topic_name for the other relay topics).

Files

  • pulsar/capabilities.py — collector + Pydantic-typed PulsarCapabilities dataclass. Reads through typed attributes (PulsarApp / StatefulManagerProxy via TYPE_CHECKING) rather than getattr probes; the few remaining getattr sites are for fields that legitimately vary across third-party manager/resolver subclasses and are commented as such.
  • pulsar/core.py — wires the collector into PulsarApp init.
  • pulsar/messaging/bind_relay.py — adds __make_capabilities_topic_name, _publish_capabilities, and the one-shot publish from bind_app.
  • pulsar/messaging/__init__.py — re-exports.
  • app.yml.sample — documents message_queue_publish_capabilities.
  • test/capabilities_test.py — 25 unit tests for the collector + topic-name convention.
  • test/messaging_capabilities_test.py — 4 messaging-side tests (publish, error-swallowing, topic-name shape, disabled-flag respect). Guarded with importlib.util.find_spec('pulsar_relay_client') because the relay client requires Python >=3.10.
  • test/resilience/scenarios/test_capabilities_publish.py — end-to-end against a live relay.

Test plan

  • tox -e py310-test test/capabilities_test.py test/messaging_capabilities_test.py — 29 pass locally
  • tox -e lint
  • tox -e mypy — clean, 183 source files
  • CI pulsar job (unit + messaging + resilience harness): green
  • Galaxy-side e2e: test/integration/compute_resources/test_compute_resource_tool_execution.py consumes the snapshot via the relay's GET /api/v1/topics/{topic}/messages?limit=1&order=desc; tracked in the Galaxy BYOC PR.

Notes

  • This PR's contract is one-way: pulsar publishes, the relay stores, Galaxy reads. The relay is dumb storage; no protocol change there. The relay client (pulsar-relay-client>=0.2.2) already wraps the read side via HttpRelayClient.fetch_messages.
  • On Python <3.10 the messaging_capabilities_test module is silently skipped (the relay client is python_requires=">=3.10"); the rest of the codebase is unaffected.
  • No changes to job execution paths — every existing manager binds the same way; the only new thing they emit is a single advisory message.

@mvdbeek mvdbeek force-pushed the pulsar-capabilities branch 2 times, most recently from 6b51584 to 1294838 Compare May 15, 2026 13:17
Comment thread pulsar/capabilities.py Outdated

if TYPE_CHECKING:
# Imported lazily to avoid the runtime cycle: pulsar.core imports this
# module to populate the cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that comment around TYPE_CHECKING. This feels like over correction in your Claude memory? Lazy imports in the body of the Python file is different than TYPE_CHECKING imports IMO. Not a blocker - just trying to adjust future development if this is what is happening.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i'll drop that, i keep telling claude to remove unmotivated local imports and i guess that's what happened as a response.

@jmchilton

Copy link
Copy Markdown
Member

🤖 Posted by Claude (AI assistant) on John's behalf — not authored by John personally.

Minor: the comment on PulsarApp.__setup_capabilities in pulsar/core.py references a get_capabilities symbol that does not exist anywhere in this PR (or the codebase):

def __setup_capabilities(self):
    # Snapshot of static config + host probes that get_capabilities
    # publishes to the relay. ...

The actual publisher is publish_manager_capabilities_to_relay in pulsar/messaging/bind_relay.py. Suggest renaming the reference (or dropping the symbol) so it does not send readers hunting for a method that is not there.

@mvdbeek mvdbeek force-pushed the pulsar-capabilities branch 2 times, most recently from 7d78702 to a9dae16 Compare May 15, 2026 18:09
@jmchilton

Copy link
Copy Markdown
Member

🤖 Posted by Claude (AI assistant) on John's behalf — not authored by John personally.

Follow-up to the earlier comment about the unconditional capability collection. Proposing a concrete change and would like to request it be folded into this PR (or a fast follow-up):

Detailed, reviewed plan: https://gist.github.com/jmchilton/e86be28bdeb331d3e0437b58f261fbda

It rolls together two coupled changes:

  • Lazy collection (the substantive one). Move collect_capabilities out of PulsarApp.__init__ and into the relay publisher, so collection happens iff relay mode is active and message_queue_publish_capabilities is true and per actually-bound manager. Today every deployment mode (AMQP, web-only, coexecution) computes and caches a snapshot only the relay path ever reads, and a per-manager collection hiccup logs an ERROR stack trace at startup for operators not using the feature. This also strengthens the "advisory, must not block bind" guarantee (collection failure is now contained in the publisher’s try/except) and is byte-for-byte wire/topic compatible (schema_version stays 1). It incidentally deletes the stale get_capabilities comment from my earlier comment for free, since that comment lives in the method being removed.

  • Re-anchoring the new tests on real objects. capabilities_test.py / messaging_capabilities_test.py lean on bespoke duck-typed _App/_Manager/_FakeApp stand-ins that mirror the same assumptions the code makes (so they stay green through a real StatefulManagerProxy/QueueManager contract drift). The plan reuses existing suite abstractions — real StatefulManagerProxy(QueueManager(...)) per the persistence_test.py pattern, plus small additive extensions to minimal_app_for_managers()/TestDependencyManager and a shared RecordingRelayTransport fake — so the doubles that remain are only at genuine boundaries (relay HTTP, shutil.which, and one explicitly-justified seam patch). The pure-logic tests (_conda_*, container-runtime cross-product, real CondaDependencyResolver) are kept as-is.

The plan was reviewed by a separate agent against the actual implementation; verdict was mostly yes (sound design, real test improvement), with two framing corrections it flagged already applied to the linked version — notably keeping the num_concurrent_jobs-fallback branch tests (that branch is unreachable by any in-tree manager, so a minimal stub there is correct, not mock theater).

Happy to open this as a PR against the branch if the direction looks right to you, @mvdbeek. Flagging the lazy-collection piece as the part worth an explicit decision; the test rework can stand alone if you prefer to take only the first half.

@mvdbeek

mvdbeek commented May 15, 2026

Copy link
Copy Markdown
Member Author

Yes, direction looks good, you can directly push to my branch if you like. Did you manually prompt for the plan or is there something I can add to my workflow ?

@jmchilton

Copy link
Copy Markdown
Member

🤖 Posted by Claude (AI assistant) on John's behalf — not authored by John personally.

One more follow-up: this PR adds a capability snapshot to the relay but the prose docs (docs/configure.rst) only get app.yml.sample. Two consistency/completeness gaps result (not new design — just describing what this PR already does):

  1. The numbered Architecture enumeration in the "Message Queue (pulsar-relay)" section lists every interaction crossing the relay (1–5); the capability publish is a real additional Pulsar→Relay interaction that’s now missing from it.
  2. Config-option parity — every other relay knob (message_queue_url, message_queue_username/password, relay_topic_prefix) is in the prose docs; message_queue_publish_capabilities lives only in app.yml.sample.

Drafted drop-in RST for both (matching the existing heading style), plus one explicitly-optional error_handling.rst line so readers don’t assume the snapshot POST is delivery-guaranteed:

https://gist.github.com/jmchilton/84b6dcda87a395f899041848f1743dd5

Deliberately excluded as speculative: Galaxy-side BYOC consumer documentation (how Galaxy fetches/interprets the snapshot) and a field-by-field wire-schema reference — that contract is still in flux on the separate unmerged Galaxy PR and arguably belongs in Galaxy’s docs.

Happy to push these into the branch if the direction looks right, @mvdbeek.

@mvdbeek

mvdbeek commented May 15, 2026

Copy link
Copy Markdown
Member Author

yes, also good!

mvdbeek added 2 commits May 19, 2026 10:22
Adds a static config + host-probe snapshot (staging dirs, dependency
resolvers, container runtimes, manager type) collected once during
PulsarApp init and POSTed once to a per-manager relay topic from
messaging.bind_app.

Galaxy can fetch the latest snapshot synchronously via the relay's
existing /api/v1/topics/{topic}/messages?limit=1&order=desc endpoint
and use it to auto-fill destination params on BYOC bootstrap and to
downgrade requests at job-build time when the remote pulsar doesn't
actually offer what the destination asks for.

The publish is fire-and-forget and gated by message_queue_publish_capabilities
(default True). Failures are logged and swallowed — capabilities are
advisory and must not block manager bind.
Add Architecture item 6 and a "Capability Snapshot" sub-subsection to
docs/configure.rst, and a one-line advisory-exclusion note to the
pulsar-relay durability section of docs/error_handling.rst. The
``message_queue_publish_capabilities`` knob now has the same prose
documentation coverage as the other relay configuration options.
@mvdbeek mvdbeek force-pushed the pulsar-capabilities branch from a9dae16 to 4962f59 Compare May 19, 2026 08:31
@jmchilton

Copy link
Copy Markdown
Member

I missed these follow ups - as I always do - sorry.

Did you manually prompt for the plan or is there something I can add to my workflow ?

I was in there asking very specific questions. It was definitely not something I could just hand off - I found a thread and pulled on it.

@jmchilton jmchilton merged commit 9ab40ff into galaxyproject:master May 26, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants