Publish capability snapshots to the relay for Galaxy BYOC#458
Conversation
6b51584 to
1294838
Compare
|
|
||
| if TYPE_CHECKING: | ||
| # Imported lazily to avoid the runtime cycle: pulsar.core imports this | ||
| # module to populate the cache. |
There was a problem hiding this comment.
I don't think we need that comment around TYPE_CHECKING. This feels like over correction in your Claude memory? Lazy imports in the body of the Python file is different than TYPE_CHECKING imports IMO. Not a blocker - just trying to adjust future development if this is what is happening.
There was a problem hiding this comment.
yep, i'll drop that, i keep telling claude to remove unmotivated local imports and i guess that's what happened as a response.
|
🤖 Posted by Claude (AI assistant) on John's behalf — not authored by John personally. Minor: the comment on def __setup_capabilities(self):
# Snapshot of static config + host probes that get_capabilities
# publishes to the relay. ...The actual publisher is |
7d78702 to
a9dae16
Compare
|
🤖 Posted by Claude (AI assistant) on John's behalf — not authored by John personally. Follow-up to the earlier comment about the unconditional capability collection. Proposing a concrete change and would like to request it be folded into this PR (or a fast follow-up): Detailed, reviewed plan: https://gist.github.com/jmchilton/e86be28bdeb331d3e0437b58f261fbda It rolls together two coupled changes:
The plan was reviewed by a separate agent against the actual implementation; verdict was mostly yes (sound design, real test improvement), with two framing corrections it flagged already applied to the linked version — notably keeping the Happy to open this as a PR against the branch if the direction looks right to you, @mvdbeek. Flagging the lazy-collection piece as the part worth an explicit decision; the test rework can stand alone if you prefer to take only the first half. |
|
Yes, direction looks good, you can directly push to my branch if you like. Did you manually prompt for the plan or is there something I can add to my workflow ? |
|
🤖 Posted by Claude (AI assistant) on John's behalf — not authored by John personally. One more follow-up: this PR adds a capability snapshot to the relay but the prose docs (
Drafted drop-in RST for both (matching the existing heading style), plus one explicitly-optional https://gist.github.com/jmchilton/84b6dcda87a395f899041848f1743dd5 Deliberately excluded as speculative: Galaxy-side BYOC consumer documentation (how Galaxy fetches/interprets the snapshot) and a field-by-field wire-schema reference — that contract is still in flux on the separate unmerged Galaxy PR and arguably belongs in Galaxy’s docs. Happy to push these into the branch if the direction looks right, @mvdbeek. |
|
yes, also good! |
Adds a static config + host-probe snapshot (staging dirs, dependency
resolvers, container runtimes, manager type) collected once during
PulsarApp init and POSTed once to a per-manager relay topic from
messaging.bind_app.
Galaxy can fetch the latest snapshot synchronously via the relay's
existing /api/v1/topics/{topic}/messages?limit=1&order=desc endpoint
and use it to auto-fill destination params on BYOC bootstrap and to
downgrade requests at job-build time when the remote pulsar doesn't
actually offer what the destination asks for.
The publish is fire-and-forget and gated by message_queue_publish_capabilities
(default True). Failures are logged and swallowed — capabilities are
advisory and must not block manager bind.
Add Architecture item 6 and a "Capability Snapshot" sub-subsection to docs/configure.rst, and a one-line advisory-exclusion note to the pulsar-relay durability section of docs/error_handling.rst. The ``message_queue_publish_capabilities`` knob now has the same prose documentation coverage as the other relay configuration options.
a9dae16 to
4962f59
Compare
|
I missed these follow ups - as I always do - sorry.
I was in there asking very specific questions. It was definitely not something I could just hand off - I found a thread and pulled on it. |
Summary
Adds a one-shot capability snapshot that the Pulsar daemon publishes to its message-queue relay on startup. Galaxy's "bring your own compute" (BYOC) integration uses it to:
staging_directory, runtime flags, manager type) when a user bootstraps a new resource, andsingularity_enabled: truebut no singularity on$PATH).The publish is fire-and-forget, gated by the new
message_queue_publish_capabilitiessetting (defaultTrue), and runs once per manager duringmessaging.bind_app. Failures are logged and swallowed — capabilities are advisory and must not block manager bind.What the snapshot contains (schema_version=1)
Collected once during
PulsarApp.__init__from existing pulsar state, no external probes beyondshutil.which:pulsar_version,manager_name,manager_type,num_concurrent_jobs,work_threadsstaging_directory,persistence_directory,tool_dependency_dirdependency_resolvers: list of{resolver_type, versionless, prefix?, ensure_channels?}for each configured resolver (conda fields read throughisinstance(r, CondaDependencyResolver)rather than string-typed probes)container_runtime:{docker_available, singularity_available, apptainer_available}fromshutil.whichconda_available: convenience flag derived from the resolver listPublished once to
pulsar_capabilities(or<prefix>_pulsar_capabilities[_<manager>]for non-default managers, mirroring__make_capabilities_topic_namefor the other relay topics).Files
pulsar/capabilities.py— collector + Pydantic-typedPulsarCapabilitiesdataclass. Reads through typed attributes (PulsarApp/StatefulManagerProxyviaTYPE_CHECKING) rather than getattr probes; the few remaining getattr sites are for fields that legitimately vary across third-party manager/resolver subclasses and are commented as such.pulsar/core.py— wires the collector intoPulsarAppinit.pulsar/messaging/bind_relay.py— adds__make_capabilities_topic_name,_publish_capabilities, and the one-shot publish frombind_app.pulsar/messaging/__init__.py— re-exports.app.yml.sample— documentsmessage_queue_publish_capabilities.test/capabilities_test.py— 25 unit tests for the collector + topic-name convention.test/messaging_capabilities_test.py— 4 messaging-side tests (publish, error-swallowing, topic-name shape, disabled-flag respect). Guarded withimportlib.util.find_spec('pulsar_relay_client')because the relay client requires Python >=3.10.test/resilience/scenarios/test_capabilities_publish.py— end-to-end against a live relay.Test plan
tox -e py310-test test/capabilities_test.py test/messaging_capabilities_test.py— 29 pass locallytox -e linttox -e mypy— clean, 183 source filespulsarjob (unit + messaging + resilience harness): greentest/integration/compute_resources/test_compute_resource_tool_execution.pyconsumes the snapshot via the relay'sGET /api/v1/topics/{topic}/messages?limit=1&order=desc; tracked in the Galaxy BYOC PR.Notes
pulsar-relay-client>=0.2.2) already wraps the read side viaHttpRelayClient.fetch_messages.messaging_capabilities_testmodule is silently skipped (the relay client ispython_requires=">=3.10"); the rest of the codebase is unaffected.