Skip to content

Bootstrap pulsar-relay credentials via OIDC device flow#454

Merged
jmchilton merged 16 commits into
galaxyproject:masterfrom
mvdbeek:pulsar-config-login
May 13, 2026
Merged

Bootstrap pulsar-relay credentials via OIDC device flow#454
jmchilton merged 16 commits into
galaxyproject:masterfrom
mvdbeek:pulsar-config-login

Conversation

@mvdbeek

@mvdbeek mvdbeek commented May 11, 2026

Copy link
Copy Markdown
Member

Summary

Adds pulsar-config --login --relay-url URL, a one-command happy path for configuring a fresh Pulsar instance against a pulsar-relay server. The CLI runs the RFC 8628 device-authorization grant in the operator's browser, scaffolds a relay-flavoured config directory, and writes a relay_credentials.json (mode 0600) holding a rotating refresh token. The daemon then exchanges the refresh token for short-lived access JWTs automatically — no shared password to manage and no manual app.yml edits.

The branch is 3 self-contained commits:

  1. 438713e Bootstrap pulsar-relay credentials via OIDC device flow — adds the device-flow client (relay_device_flow.py), the credentials-file abstraction (relay_credentials.py), and refactors RelayAuthManager into pluggable Authenticator strategies (PasswordAuthenticator, RefreshTokenAuthenticator, DeviceFlowAuthenticator). The legacy RelayAuthManager(url, username, password) signature is preserved.
  2. ded6cfe Let pulsar-config --login compose with --mq — when both flags are passed, the scaffolded app.yml is written with message_queue_url + message_queue_credentials_file pointing at the file --login is about to produce, so the directory is ready to start Pulsar against without further edits.
  3. eae46dc Make pulsar-config --login the do-everything happy path--login now implies --mq, and refuses to overwrite an existing app.yml unless --force is set. A new --login-only flag is added for operators with hand-rolled app.yml who only want to bootstrap the credentials file.

CLI surface

# Greenfield: scaffold app.yml + local_env.sh + dependencies/ + sign in
pulsar-config --login --relay-url https://relay.example.org:9000 --directory .

# Operator already has app.yml — just (re)bootstrap the credentials file
pulsar-config --login-only --relay-url https://relay.example.org:9000 --directory .

# Refresh against an existing scaffolded directory
pulsar-config --login --relay-url ... --force --directory .

Why

Previously the only way to authenticate to pulsar-relay was a pre-shared username/password baked into app.yml, which is awkward to provision and to rotate. The device flow leans on the relay's existing OIDC integration: an operator signs in interactively once, and the daemon receives a rotating refresh token it can use indefinitely without further human intervention.

Backwards compatibility

  • RelayAuthManager(url, username, password) keeps working for existing deployments.
  • pulsar-config --mq without --relay-url still produces the AMQP-flavoured app.yml it always did; only --mq --relay-url switches to the relay template.
  • app.yml.sample is updated to document the new message_queue_credentials_file key alongside the legacy message_queue_username / message_queue_password, with a comment recommending the credentials-file path for new deployments.

mvdbeek added 4 commits May 11, 2026 11:38
Adds `pulsar-config --login --relay-url URL`, which runs the RFC 8628
device-authorization-grant against the relay (printing a verification
URL + user code), waits for an operator to complete an OIDC sign-in in
any browser, and writes a `relay_credentials.json` (mode 0600) holding
a rotating refresh token. The pulsar daemon then picks up the file via
`message_queue_credentials_file: ./relay_credentials.json` and rotates
the token automatically — no shared password to manage.

RelayAuthManager is refactored into pluggable Authenticator strategies
(Password / RefreshToken / DeviceFlow). The legacy
`RelayAuthManager(url, username, password)` signature is preserved so
existing deployments keep working without changes; new deployments
should prefer the credentials file path.
Previously --login early-returned and just wrote relay_credentials.json,
leaving operators to hand-edit a fresh app.yml. Combine the two: when
--mq and --relay-url are both passed, the scaffolded app.yml gets
message_queue_url + message_queue_credentials_file pointing at the
file --login is about to write. After the device flow completes the
config directory is ready to start pulsar against without further
edits.

Standalone --login (no --mq) still works for the case where app.yml
already exists.
Previously --login required also passing --mq, which surprised
operators who reasonably assumed `--login --relay-url URL` was
self-contained. Two behavioral changes:

- --login now implies --mq. Running `pulsar-config --login --relay-url
  URL` in an empty directory scaffolds app.yml (with
  message_queue_url + message_queue_credentials_file), local_env.sh,
  dependencies/, then runs the device flow.

- --login refuses to clobber an existing app.yml unless --force is
  passed. The error points at the new --login-only flag, which only
  bootstraps relay_credentials.json and leaves any existing config
  untouched — for operators who hand-roll their own app.yml.

The legacy bare `pulsar-config --mq` (no --relay-url) still produces an
AMQP-flavoured config; only `--mq --relay-url` switches to the relay
template.
relay_auth_test.py imports the responses lib to stub the relay HTTP
endpoints, but the package was never declared as a dev dependency. The
inline comment claimed it was "already a dev dep"; it wasn't, so
``tox -e mypy`` failed with an import-not-found error on a clean venv.
mvdbeek added 12 commits May 13, 2026 15:02
Introduces the pulsar-side pieces for Bring-Your-Own-Compute registration
with a Galaxy server:

* ``pulsar-config --register-with-galaxy <url> --galaxy-token <token>``
  drives the relay's RFC 8628 device flow with ``pair=true``, decodes the
  ``sub`` from the access token as the BYOC manager_name, posts the
  *secondary* refresh token to Galaxy's bootstrap endpoint, and writes a
  local ``relay_credentials.json`` + ``app.yml`` so the Pulsar daemon can
  immediately connect to the same topics Galaxy publishes to.

* ``pulsar.client.galaxy_byoc.register_with_galaxy()`` is the
  orchestrator (callable from tests and the CLI).

* ``RelayClientManager`` accepts ``relay_refresh_token`` +
  ``on_refresh_token_rotated`` for in-memory, callback-persisted
  credentials (Galaxy's multi-tenant runner holds tokens in its vault
  rather than on disk).

* ``_per_handler_cursor_path`` learns about ``manager_name`` so one
  Galaxy handler driving many BYOC managers gets a distinct cursor file
  per tenant.

* ``RelayAuthManager`` and ``RelayDeviceFlowAuthenticator`` gain
  ``credentials_store=`` / ``pair=`` knobs to support the in-memory and
  pair-issuance paths.
The pulsar-relay HTTP wire contract (auth, credentials, device flow,
transport, topic ownership) has been extracted into a standalone
``pulsar-relay-client`` distribution shipped from the pulsar-relay
repo. This commit retires the in-tree copies and depends on the
package instead.

* Adds ``pulsar-relay-client>=1.0,<2`` to ``requirements.txt``.
* Replaces ``pulsar.client.relay_auth`` / ``relay_credentials`` /
  ``relay_device_flow`` / ``transport.relay`` with imports from
  ``pulsar_relay_client``. The modules and their tests are deleted
  outright — no re-export shim — so external consumers update at the
  same time.
* ``RelayClientManager`` and ``pulsar-config --register-with-galaxy``
  consume the same APIs through the package; the daemon-specific
  bits (``RelayClientManager`` subscriber lifecycle,
  ``_per_handler_cursor_path``, the BYOC CLI orchestrator) stay here.

See ``docs/extracting-client.md`` in the pulsar-relay repo for the
full extraction plan and rationale.
The migration commit pinned `>=1.0,<2`, but the package's published
PyPI release line is 0.x (latest 0.2.0). The previous pin failed to
resolve against PyPI; this matches what is actually shipped.
* `_validate_relay_args` extracts the `--login` / `--login-only` /
  `--register-with-galaxy` validation out of `main()` so the function
  drops back under the McCabe complexity ceiling (17 → below 14).

* `**kwds: Dict[str, Any]` was always wrong (it claims each value is
  itself a dict). The latent error stayed quiet until the migration
  routed `self.manager_name` through the relay-client's strictly-typed
  `_per_handler_cursor_path`, which then surfaced the bogus
  `dict[str, Any] | str`. Use `**kwds: Any` everywhere in
  `pulsar/client/manager.py`.
The published pulsar-relay-client wheel is gated to ``Requires-Python
>=3.10``, so a flat top-level dependency broke install on Pulsar's
3.7 test matrix entries (test-ci, test-unit, lint, docs).

* requirements.txt: mark pulsar-relay-client conditional on
  ``python_version >= "3.10"`` so older interpreters stop trying to
  resolve it.
* Move the ``from pulsar_relay_client import ...`` lines in
  ``messaging/bind_relay.py``, ``client/galaxy_byoc.py``, and
  ``client/manager.py`` into the function/method that actually uses
  the symbols, so importing those modules no longer pulls the
  unavailable dependency. The relay code paths themselves remain
  unreachable on <3.10, which matches the dependency's reality.
The three end-to-end ``register_with_galaxy`` tests instantiate the
function under test, which lazy-imports ``pulsar_relay_client``.
That dependency is now ``python_version >= "3.10"``, so on the 3.7
matrix entries the import blows up at test time. Skip them when
``find_spec("pulsar_relay_client")`` returns None — the pure-Python
``_decode_jwt_sub`` tests in the same module still run everywhere.
0.2.1 (PR galaxyproject/pulsar-relay#8) adds PULSAR_RELAY_ALLOW_INSECURE=1
to disable the plaintext-to-non-localhost rejection in the URL guard.
The resilience compose stack routes pulsar/mock-galaxy through
http://toxiproxy:9000 (and http://relay:8080) so faults can be
injected without TLS termination — exactly the controlled context
the bypass is for. Inject the env var on both services and bump the
requirement so the bypass is actually present at install time.
The ``;`` filter in ``setup.py`` is vestigial: it dates from a
Python-2.7-fallback split (``py27_requirements`` was the other half)
that was deleted in ac9bfc7. The filter stayed and silently dropped
every PEP 508 marker-bearing requirement from ``install_requires`` —
which is why ``pulsar-relay-client>=0.2.1; python_version >= "3.10"``
was missing from the resilience-suite Pulsar image and the container
crashed at startup with ``ModuleNotFoundError``. setuptools handles
markers natively, so feed it the file as-is.
The pulsar-relay image post-0.2.0 added a secrets guard plus a
supervisord template that substitutes PULSAR_VALKEY_PASSWORD into
the valkey-server command line. The resilience compose definition
provided neither, so the relay container was crash-looping with a
supervisord templating error and never came up — every relay-mode
scenario then failed downstream with DNS / poll errors against the
absent service.

Set the two secrets the image actually needs (valkey password, JWT
key) and turn on PULSAR_ALLOW_INSECURE_DEFAULTS=1, which is the
documented CI / local-dev opt-out for the rest of the guard. The
network is a private docker bridge in CI; this is the controlled
context the flag was added for.
The relay's Settings model parses PULSAR_ALLOWED_ORIGINS /
PULSAR_TRUSTED_HOSTS as ``list[str]`` via pydantic-settings, which
expects a JSON literal in the env (the relay's own startup-error
documents ``'["https://..."]'`` as the canonical form). The bare
``*`` value blew up with ``SettingsError: error parsing value for
field "allowed_origins"`` and the relay never came up.
Before: ``wait_until_consuming`` returned for relay mode as soon as
pulsar logged ``bind_manager_to_relay called``. After the migration
to pulsar-relay-client (which now performs an OIDC device-flow
auth before the first long-poll), the bind log lands ~70 ms before
pulsar's first poll-waiter actually exists on the relay.

The relay only delivers messages published *after* a waiter is
created, so a setup that beats pulsar's first poll by a few ms is
permanently lost — and B3, the only relay-mode test with a
``time.sleep(0.6)`` between publish and proxy-disable, started
losing the race deterministically: mock-galaxy publishes at
T+~70 ms, pulsar registers at T+~73 ms, message gone, job never
starts, test times out at 120 s.

Wait for ``Acquired pulsar-relay access token`` in addition to
``bind_manager_to``: the token-acquired log is emitted by
pulsar_relay_client.auth immediately before the first long_poll()
call hits the wire, so the guard collapses the race window from
~70 ms to ~3 ms — well under the harness's 100 ms poll cadence.
Replace the previous log-grep guard (``Acquired pulsar-relay access
token``) with a query against the relay's ``/messages/poll/stats``
endpoint until pulsar's poll-waiter is registered for the
``job_setup`` topic.

The log-grep narrowed the publish-before-first-poll window from
~70 ms to ~3 ms, but it was still racy: ``token acquired`` fires
in pulsar a few ms before the first ``POST /messages/poll`` lands
on the relay and creates the waiter. Any future change to pulsar's
startup sequence (more handshakes, retries, slower auth) would
re-open the gap.

The stats endpoint reports the relay's *server-side* waiter map, so
``waiter exists for */job_setup`` is the actual condition we need
to assert before the test resumes — there is no remaining timing
gap between the assertion and the next ``_publish_setup`` from the
test. Future startup-latency changes are now harmless.

Compose: relay's port 8080 is host-mapped on 8081 so the harness
can reach the stats endpoint while a fault scenario has toxiproxy
disabled. Stats and login go directly, like RABBITMQ_MGMT.
@jmchilton

Copy link
Copy Markdown
Member

Looking really good architecturally so far - thank you so much!

@mvdbeek

mvdbeek commented May 13, 2026

Copy link
Copy Markdown
Member Author

Happy to merge ? I have some followup work to enable pushing the app config back to relay, but it's kind of a different thing we could do separately ?

@jmchilton jmchilton merged commit 28a76f2 into galaxyproject:master May 13, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants