Bootstrap pulsar-relay credentials via OIDC device flow by mvdbeek · Pull Request #454 · galaxyproject/pulsar

mvdbeek · 2026-05-11T09:34:27Z

Summary

Adds pulsar-config --login --relay-url URL, a one-command happy path for configuring a fresh Pulsar instance against a pulsar-relay server. The CLI runs the RFC 8628 device-authorization grant in the operator's browser, scaffolds a relay-flavoured config directory, and writes a relay_credentials.json (mode 0600) holding a rotating refresh token. The daemon then exchanges the refresh token for short-lived access JWTs automatically — no shared password to manage and no manual app.yml edits.

The branch is 3 self-contained commits:

438713e Bootstrap pulsar-relay credentials via OIDC device flow — adds the device-flow client (relay_device_flow.py), the credentials-file abstraction (relay_credentials.py), and refactors RelayAuthManager into pluggable Authenticator strategies (PasswordAuthenticator, RefreshTokenAuthenticator, DeviceFlowAuthenticator). The legacy RelayAuthManager(url, username, password) signature is preserved.
ded6cfe Let pulsar-config --login compose with --mq — when both flags are passed, the scaffolded app.yml is written with message_queue_url + message_queue_credentials_file pointing at the file --login is about to produce, so the directory is ready to start Pulsar against without further edits.
eae46dc Make pulsar-config --login the do-everything happy path — --login now implies --mq, and refuses to overwrite an existing app.yml unless --force is set. A new --login-only flag is added for operators with hand-rolled app.yml who only want to bootstrap the credentials file.

CLI surface

# Greenfield: scaffold app.yml + local_env.sh + dependencies/ + sign in
pulsar-config --login --relay-url https://relay.example.org:9000 --directory .

# Operator already has app.yml — just (re)bootstrap the credentials file
pulsar-config --login-only --relay-url https://relay.example.org:9000 --directory .

# Refresh against an existing scaffolded directory
pulsar-config --login --relay-url ... --force --directory .

Why

Previously the only way to authenticate to pulsar-relay was a pre-shared username/password baked into app.yml, which is awkward to provision and to rotate. The device flow leans on the relay's existing OIDC integration: an operator signs in interactively once, and the daemon receives a rotating refresh token it can use indefinitely without further human intervention.

Backwards compatibility

RelayAuthManager(url, username, password) keeps working for existing deployments.
pulsar-config --mq without --relay-url still produces the AMQP-flavoured app.yml it always did; only --mq --relay-url switches to the relay template.
app.yml.sample is updated to document the new message_queue_credentials_file key alongside the legacy message_queue_username / message_queue_password, with a comment recommending the credentials-file path for new deployments.

Adds `pulsar-config --login --relay-url URL`, which runs the RFC 8628 device-authorization-grant against the relay (printing a verification URL + user code), waits for an operator to complete an OIDC sign-in in any browser, and writes a `relay_credentials.json` (mode 0600) holding a rotating refresh token. The pulsar daemon then picks up the file via `message_queue_credentials_file: ./relay_credentials.json` and rotates the token automatically — no shared password to manage. RelayAuthManager is refactored into pluggable Authenticator strategies (Password / RefreshToken / DeviceFlow). The legacy `RelayAuthManager(url, username, password)` signature is preserved so existing deployments keep working without changes; new deployments should prefer the credentials file path.

Previously --login early-returned and just wrote relay_credentials.json, leaving operators to hand-edit a fresh app.yml. Combine the two: when --mq and --relay-url are both passed, the scaffolded app.yml gets message_queue_url + message_queue_credentials_file pointing at the file --login is about to write. After the device flow completes the config directory is ready to start pulsar against without further edits. Standalone --login (no --mq) still works for the case where app.yml already exists.

Previously --login required also passing --mq, which surprised operators who reasonably assumed `--login --relay-url URL` was self-contained. Two behavioral changes: - --login now implies --mq. Running `pulsar-config --login --relay-url URL` in an empty directory scaffolds app.yml (with message_queue_url + message_queue_credentials_file), local_env.sh, dependencies/, then runs the device flow. - --login refuses to clobber an existing app.yml unless --force is passed. The error points at the new --login-only flag, which only bootstraps relay_credentials.json and leaves any existing config untouched — for operators who hand-roll their own app.yml. The legacy bare `pulsar-config --mq` (no --relay-url) still produces an AMQP-flavoured config; only `--mq --relay-url` switches to the relay template.

relay_auth_test.py imports the responses lib to stub the relay HTTP endpoints, but the package was never declared as a dev dependency. The inline comment claimed it was "already a dev dep"; it wasn't, so ``tox -e mypy`` failed with an import-not-found error on a clean venv.

Introduces the pulsar-side pieces for Bring-Your-Own-Compute registration with a Galaxy server: * ``pulsar-config --register-with-galaxy <url> --galaxy-token <token>`` drives the relay's RFC 8628 device flow with ``pair=true``, decodes the ``sub`` from the access token as the BYOC manager_name, posts the *secondary* refresh token to Galaxy's bootstrap endpoint, and writes a local ``relay_credentials.json`` + ``app.yml`` so the Pulsar daemon can immediately connect to the same topics Galaxy publishes to. * ``pulsar.client.galaxy_byoc.register_with_galaxy()`` is the orchestrator (callable from tests and the CLI). * ``RelayClientManager`` accepts ``relay_refresh_token`` + ``on_refresh_token_rotated`` for in-memory, callback-persisted credentials (Galaxy's multi-tenant runner holds tokens in its vault rather than on disk). * ``_per_handler_cursor_path`` learns about ``manager_name`` so one Galaxy handler driving many BYOC managers gets a distinct cursor file per tenant. * ``RelayAuthManager`` and ``RelayDeviceFlowAuthenticator`` gain ``credentials_store=`` / ``pair=`` knobs to support the in-memory and pair-issuance paths.

The pulsar-relay HTTP wire contract (auth, credentials, device flow, transport, topic ownership) has been extracted into a standalone ``pulsar-relay-client`` distribution shipped from the pulsar-relay repo. This commit retires the in-tree copies and depends on the package instead. * Adds ``pulsar-relay-client>=1.0,<2`` to ``requirements.txt``. * Replaces ``pulsar.client.relay_auth`` / ``relay_credentials`` / ``relay_device_flow`` / ``transport.relay`` with imports from ``pulsar_relay_client``. The modules and their tests are deleted outright — no re-export shim — so external consumers update at the same time. * ``RelayClientManager`` and ``pulsar-config --register-with-galaxy`` consume the same APIs through the package; the daemon-specific bits (``RelayClientManager`` subscriber lifecycle, ``_per_handler_cursor_path``, the BYOC CLI orchestrator) stay here. See ``docs/extracting-client.md`` in the pulsar-relay repo for the full extraction plan and rationale.

The migration commit pinned `>=1.0,<2`, but the package's published PyPI release line is 0.x (latest 0.2.0). The previous pin failed to resolve against PyPI; this matches what is actually shipped.

* `_validate_relay_args` extracts the `--login` / `--login-only` / `--register-with-galaxy` validation out of `main()` so the function drops back under the McCabe complexity ceiling (17 → below 14). * `**kwds: Dict[str, Any]` was always wrong (it claims each value is itself a dict). The latent error stayed quiet until the migration routed `self.manager_name` through the relay-client's strictly-typed `_per_handler_cursor_path`, which then surfaced the bogus `dict[str, Any] | str`. Use `**kwds: Any` everywhere in `pulsar/client/manager.py`.

The published pulsar-relay-client wheel is gated to ``Requires-Python >=3.10``, so a flat top-level dependency broke install on Pulsar's 3.7 test matrix entries (test-ci, test-unit, lint, docs). * requirements.txt: mark pulsar-relay-client conditional on ``python_version >= "3.10"`` so older interpreters stop trying to resolve it. * Move the ``from pulsar_relay_client import ...`` lines in ``messaging/bind_relay.py``, ``client/galaxy_byoc.py``, and ``client/manager.py`` into the function/method that actually uses the symbols, so importing those modules no longer pulls the unavailable dependency. The relay code paths themselves remain unreachable on <3.10, which matches the dependency's reality.

The three end-to-end ``register_with_galaxy`` tests instantiate the function under test, which lazy-imports ``pulsar_relay_client``. That dependency is now ``python_version >= "3.10"``, so on the 3.7 matrix entries the import blows up at test time. Skip them when ``find_spec("pulsar_relay_client")`` returns None — the pure-Python ``_decode_jwt_sub`` tests in the same module still run everywhere.

0.2.1 (PR galaxyproject/pulsar-relay#8) adds PULSAR_RELAY_ALLOW_INSECURE=1 to disable the plaintext-to-non-localhost rejection in the URL guard. The resilience compose stack routes pulsar/mock-galaxy through http://toxiproxy:9000 (and http://relay:8080) so faults can be injected without TLS termination — exactly the controlled context the bypass is for. Inject the env var on both services and bump the requirement so the bypass is actually present at install time.

The ``;`` filter in ``setup.py`` is vestigial: it dates from a Python-2.7-fallback split (``py27_requirements`` was the other half) that was deleted in ac9bfc7. The filter stayed and silently dropped every PEP 508 marker-bearing requirement from ``install_requires`` — which is why ``pulsar-relay-client>=0.2.1; python_version >= "3.10"`` was missing from the resilience-suite Pulsar image and the container crashed at startup with ``ModuleNotFoundError``. setuptools handles markers natively, so feed it the file as-is.

The pulsar-relay image post-0.2.0 added a secrets guard plus a supervisord template that substitutes PULSAR_VALKEY_PASSWORD into the valkey-server command line. The resilience compose definition provided neither, so the relay container was crash-looping with a supervisord templating error and never came up — every relay-mode scenario then failed downstream with DNS / poll errors against the absent service. Set the two secrets the image actually needs (valkey password, JWT key) and turn on PULSAR_ALLOW_INSECURE_DEFAULTS=1, which is the documented CI / local-dev opt-out for the rest of the guard. The network is a private docker bridge in CI; this is the controlled context the flag was added for.

The relay's Settings model parses PULSAR_ALLOWED_ORIGINS / PULSAR_TRUSTED_HOSTS as ``list[str]`` via pydantic-settings, which expects a JSON literal in the env (the relay's own startup-error documents ``'["https://..."]'`` as the canonical form). The bare ``*`` value blew up with ``SettingsError: error parsing value for field "allowed_origins"`` and the relay never came up.

Before: ``wait_until_consuming`` returned for relay mode as soon as pulsar logged ``bind_manager_to_relay called``. After the migration to pulsar-relay-client (which now performs an OIDC device-flow auth before the first long-poll), the bind log lands ~70 ms before pulsar's first poll-waiter actually exists on the relay. The relay only delivers messages published *after* a waiter is created, so a setup that beats pulsar's first poll by a few ms is permanently lost — and B3, the only relay-mode test with a ``time.sleep(0.6)`` between publish and proxy-disable, started losing the race deterministically: mock-galaxy publishes at T+~70 ms, pulsar registers at T+~73 ms, message gone, job never starts, test times out at 120 s. Wait for ``Acquired pulsar-relay access token`` in addition to ``bind_manager_to``: the token-acquired log is emitted by pulsar_relay_client.auth immediately before the first long_poll() call hits the wire, so the guard collapses the race window from ~70 ms to ~3 ms — well under the harness's 100 ms poll cadence.

Replace the previous log-grep guard (``Acquired pulsar-relay access token``) with a query against the relay's ``/messages/poll/stats`` endpoint until pulsar's poll-waiter is registered for the ``job_setup`` topic. The log-grep narrowed the publish-before-first-poll window from ~70 ms to ~3 ms, but it was still racy: ``token acquired`` fires in pulsar a few ms before the first ``POST /messages/poll`` lands on the relay and creates the waiter. Any future change to pulsar's startup sequence (more handshakes, retries, slower auth) would re-open the gap. The stats endpoint reports the relay's *server-side* waiter map, so ``waiter exists for */job_setup`` is the actual condition we need to assert before the test resumes — there is no remaining timing gap between the assertion and the next ``_publish_setup`` from the test. Future startup-latency changes are now harmless. Compose: relay's port 8080 is host-mapped on 8081 so the harness can reach the stats endpoint while a fault scenario has toxiproxy disabled. Stats and login go directly, like RABBITMQ_MGMT.

jmchilton · 2026-05-13T14:17:20Z

Looking really good architecturally so far - thank you so much!

mvdbeek · 2026-05-13T14:56:11Z

Happy to merge ? I have some followup work to enable pushing the app config back to relay, but it's kind of a different thing we could do separately ?

mvdbeek added 4 commits May 11, 2026 11:38

mvdbeek force-pushed the pulsar-config-login branch from b6461c4 to 9b6c9f4 Compare May 13, 2026 09:10

mvdbeek mentioned this pull request May 13, 2026

client 0.2.1: PULSAR_RELAY_ALLOW_INSECURE bypass + dead-code fix galaxyproject/pulsar-relay#8

Merged

3 tasks

mvdbeek added 12 commits May 13, 2026 15:02

Loosen pulsar-relay-client pin to >=0.2

65de919

The migration commit pinned `>=1.0,<2`, but the package's published PyPI release line is 0.x (latest 0.2.0). The previous pin failed to resolve against PyPI; this matches what is actually shipped.

mvdbeek force-pushed the pulsar-config-login branch from fbc9354 to 4481bdd Compare May 13, 2026 13:09

mvdbeek mentioned this pull request May 13, 2026

Bring you own compute with pulsar + pulsar-relay galaxyproject/galaxy#22685

Open

jmchilton merged commit 28a76f2 into galaxyproject:master May 13, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap pulsar-relay credentials via OIDC device flow#454

Bootstrap pulsar-relay credentials via OIDC device flow#454
jmchilton merged 16 commits into
galaxyproject:masterfrom
mvdbeek:pulsar-config-login

mvdbeek commented May 11, 2026

Uh oh!

jmchilton commented May 13, 2026

Uh oh!

mvdbeek commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvdbeek commented May 11, 2026

Summary

CLI surface

Why

Backwards compatibility

Uh oh!

jmchilton commented May 13, 2026

Uh oh!

mvdbeek commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants