Bootstrap pulsar-relay credentials via OIDC device flow#454
Merged
Conversation
Adds `pulsar-config --login --relay-url URL`, which runs the RFC 8628 device-authorization-grant against the relay (printing a verification URL + user code), waits for an operator to complete an OIDC sign-in in any browser, and writes a `relay_credentials.json` (mode 0600) holding a rotating refresh token. The pulsar daemon then picks up the file via `message_queue_credentials_file: ./relay_credentials.json` and rotates the token automatically — no shared password to manage. RelayAuthManager is refactored into pluggable Authenticator strategies (Password / RefreshToken / DeviceFlow). The legacy `RelayAuthManager(url, username, password)` signature is preserved so existing deployments keep working without changes; new deployments should prefer the credentials file path.
Previously --login early-returned and just wrote relay_credentials.json, leaving operators to hand-edit a fresh app.yml. Combine the two: when --mq and --relay-url are both passed, the scaffolded app.yml gets message_queue_url + message_queue_credentials_file pointing at the file --login is about to write. After the device flow completes the config directory is ready to start pulsar against without further edits. Standalone --login (no --mq) still works for the case where app.yml already exists.
Previously --login required also passing --mq, which surprised operators who reasonably assumed `--login --relay-url URL` was self-contained. Two behavioral changes: - --login now implies --mq. Running `pulsar-config --login --relay-url URL` in an empty directory scaffolds app.yml (with message_queue_url + message_queue_credentials_file), local_env.sh, dependencies/, then runs the device flow. - --login refuses to clobber an existing app.yml unless --force is passed. The error points at the new --login-only flag, which only bootstraps relay_credentials.json and leaves any existing config untouched — for operators who hand-roll their own app.yml. The legacy bare `pulsar-config --mq` (no --relay-url) still produces an AMQP-flavoured config; only `--mq --relay-url` switches to the relay template.
relay_auth_test.py imports the responses lib to stub the relay HTTP endpoints, but the package was never declared as a dev dependency. The inline comment claimed it was "already a dev dep"; it wasn't, so ``tox -e mypy`` failed with an import-not-found error on a clean venv.
b6461c4 to
9b6c9f4
Compare
Merged
3 tasks
Introduces the pulsar-side pieces for Bring-Your-Own-Compute registration with a Galaxy server: * ``pulsar-config --register-with-galaxy <url> --galaxy-token <token>`` drives the relay's RFC 8628 device flow with ``pair=true``, decodes the ``sub`` from the access token as the BYOC manager_name, posts the *secondary* refresh token to Galaxy's bootstrap endpoint, and writes a local ``relay_credentials.json`` + ``app.yml`` so the Pulsar daemon can immediately connect to the same topics Galaxy publishes to. * ``pulsar.client.galaxy_byoc.register_with_galaxy()`` is the orchestrator (callable from tests and the CLI). * ``RelayClientManager`` accepts ``relay_refresh_token`` + ``on_refresh_token_rotated`` for in-memory, callback-persisted credentials (Galaxy's multi-tenant runner holds tokens in its vault rather than on disk). * ``_per_handler_cursor_path`` learns about ``manager_name`` so one Galaxy handler driving many BYOC managers gets a distinct cursor file per tenant. * ``RelayAuthManager`` and ``RelayDeviceFlowAuthenticator`` gain ``credentials_store=`` / ``pair=`` knobs to support the in-memory and pair-issuance paths.
The pulsar-relay HTTP wire contract (auth, credentials, device flow, transport, topic ownership) has been extracted into a standalone ``pulsar-relay-client`` distribution shipped from the pulsar-relay repo. This commit retires the in-tree copies and depends on the package instead. * Adds ``pulsar-relay-client>=1.0,<2`` to ``requirements.txt``. * Replaces ``pulsar.client.relay_auth`` / ``relay_credentials`` / ``relay_device_flow`` / ``transport.relay`` with imports from ``pulsar_relay_client``. The modules and their tests are deleted outright — no re-export shim — so external consumers update at the same time. * ``RelayClientManager`` and ``pulsar-config --register-with-galaxy`` consume the same APIs through the package; the daemon-specific bits (``RelayClientManager`` subscriber lifecycle, ``_per_handler_cursor_path``, the BYOC CLI orchestrator) stay here. See ``docs/extracting-client.md`` in the pulsar-relay repo for the full extraction plan and rationale.
The migration commit pinned `>=1.0,<2`, but the package's published PyPI release line is 0.x (latest 0.2.0). The previous pin failed to resolve against PyPI; this matches what is actually shipped.
* `_validate_relay_args` extracts the `--login` / `--login-only` / `--register-with-galaxy` validation out of `main()` so the function drops back under the McCabe complexity ceiling (17 → below 14). * `**kwds: Dict[str, Any]` was always wrong (it claims each value is itself a dict). The latent error stayed quiet until the migration routed `self.manager_name` through the relay-client's strictly-typed `_per_handler_cursor_path`, which then surfaced the bogus `dict[str, Any] | str`. Use `**kwds: Any` everywhere in `pulsar/client/manager.py`.
The published pulsar-relay-client wheel is gated to ``Requires-Python >=3.10``, so a flat top-level dependency broke install on Pulsar's 3.7 test matrix entries (test-ci, test-unit, lint, docs). * requirements.txt: mark pulsar-relay-client conditional on ``python_version >= "3.10"`` so older interpreters stop trying to resolve it. * Move the ``from pulsar_relay_client import ...`` lines in ``messaging/bind_relay.py``, ``client/galaxy_byoc.py``, and ``client/manager.py`` into the function/method that actually uses the symbols, so importing those modules no longer pulls the unavailable dependency. The relay code paths themselves remain unreachable on <3.10, which matches the dependency's reality.
The three end-to-end ``register_with_galaxy`` tests instantiate the
function under test, which lazy-imports ``pulsar_relay_client``.
That dependency is now ``python_version >= "3.10"``, so on the 3.7
matrix entries the import blows up at test time. Skip them when
``find_spec("pulsar_relay_client")`` returns None — the pure-Python
``_decode_jwt_sub`` tests in the same module still run everywhere.
0.2.1 (PR galaxyproject/pulsar-relay#8) adds PULSAR_RELAY_ALLOW_INSECURE=1 to disable the plaintext-to-non-localhost rejection in the URL guard. The resilience compose stack routes pulsar/mock-galaxy through http://toxiproxy:9000 (and http://relay:8080) so faults can be injected without TLS termination — exactly the controlled context the bypass is for. Inject the env var on both services and bump the requirement so the bypass is actually present at install time.
The ``;`` filter in ``setup.py`` is vestigial: it dates from a Python-2.7-fallback split (``py27_requirements`` was the other half) that was deleted in ac9bfc7. The filter stayed and silently dropped every PEP 508 marker-bearing requirement from ``install_requires`` — which is why ``pulsar-relay-client>=0.2.1; python_version >= "3.10"`` was missing from the resilience-suite Pulsar image and the container crashed at startup with ``ModuleNotFoundError``. setuptools handles markers natively, so feed it the file as-is.
The pulsar-relay image post-0.2.0 added a secrets guard plus a supervisord template that substitutes PULSAR_VALKEY_PASSWORD into the valkey-server command line. The resilience compose definition provided neither, so the relay container was crash-looping with a supervisord templating error and never came up — every relay-mode scenario then failed downstream with DNS / poll errors against the absent service. Set the two secrets the image actually needs (valkey password, JWT key) and turn on PULSAR_ALLOW_INSECURE_DEFAULTS=1, which is the documented CI / local-dev opt-out for the rest of the guard. The network is a private docker bridge in CI; this is the controlled context the flag was added for.
The relay's Settings model parses PULSAR_ALLOWED_ORIGINS / PULSAR_TRUSTED_HOSTS as ``list[str]`` via pydantic-settings, which expects a JSON literal in the env (the relay's own startup-error documents ``'["https://..."]'`` as the canonical form). The bare ``*`` value blew up with ``SettingsError: error parsing value for field "allowed_origins"`` and the relay never came up.
Before: ``wait_until_consuming`` returned for relay mode as soon as pulsar logged ``bind_manager_to_relay called``. After the migration to pulsar-relay-client (which now performs an OIDC device-flow auth before the first long-poll), the bind log lands ~70 ms before pulsar's first poll-waiter actually exists on the relay. The relay only delivers messages published *after* a waiter is created, so a setup that beats pulsar's first poll by a few ms is permanently lost — and B3, the only relay-mode test with a ``time.sleep(0.6)`` between publish and proxy-disable, started losing the race deterministically: mock-galaxy publishes at T+~70 ms, pulsar registers at T+~73 ms, message gone, job never starts, test times out at 120 s. Wait for ``Acquired pulsar-relay access token`` in addition to ``bind_manager_to``: the token-acquired log is emitted by pulsar_relay_client.auth immediately before the first long_poll() call hits the wire, so the guard collapses the race window from ~70 ms to ~3 ms — well under the harness's 100 ms poll cadence.
Replace the previous log-grep guard (``Acquired pulsar-relay access token``) with a query against the relay's ``/messages/poll/stats`` endpoint until pulsar's poll-waiter is registered for the ``job_setup`` topic. The log-grep narrowed the publish-before-first-poll window from ~70 ms to ~3 ms, but it was still racy: ``token acquired`` fires in pulsar a few ms before the first ``POST /messages/poll`` lands on the relay and creates the waiter. Any future change to pulsar's startup sequence (more handshakes, retries, slower auth) would re-open the gap. The stats endpoint reports the relay's *server-side* waiter map, so ``waiter exists for */job_setup`` is the actual condition we need to assert before the test resumes — there is no remaining timing gap between the assertion and the next ``_publish_setup`` from the test. Future startup-latency changes are now harmless. Compose: relay's port 8080 is host-mapped on 8081 so the harness can reach the stats endpoint while a fault scenario has toxiproxy disabled. Stats and login go directly, like RABBITMQ_MGMT.
fbc9354 to
4481bdd
Compare
Member
|
Looking really good architecturally so far - thank you so much! |
Member
Author
|
Happy to merge ? I have some followup work to enable pushing the app config back to relay, but it's kind of a different thing we could do separately ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
pulsar-config --login --relay-url URL, a one-command happy path for configuring a fresh Pulsar instance against apulsar-relayserver. The CLI runs the RFC 8628 device-authorization grant in the operator's browser, scaffolds a relay-flavoured config directory, and writes arelay_credentials.json(mode 0600) holding a rotating refresh token. The daemon then exchanges the refresh token for short-lived access JWTs automatically — no shared password to manage and no manualapp.ymledits.The branch is 3 self-contained commits:
438713eBootstrap pulsar-relay credentials via OIDC device flow — adds the device-flow client (relay_device_flow.py), the credentials-file abstraction (relay_credentials.py), and refactorsRelayAuthManagerinto pluggableAuthenticatorstrategies (PasswordAuthenticator,RefreshTokenAuthenticator,DeviceFlowAuthenticator). The legacyRelayAuthManager(url, username, password)signature is preserved.ded6cfeLet pulsar-config --login compose with --mq — when both flags are passed, the scaffoldedapp.ymlis written withmessage_queue_url+message_queue_credentials_filepointing at the file--loginis about to produce, so the directory is ready to start Pulsar against without further edits.eae46dcMake pulsar-config --login the do-everything happy path —--loginnow implies--mq, and refuses to overwrite an existingapp.ymlunless--forceis set. A new--login-onlyflag is added for operators with hand-rolledapp.ymlwho only want to bootstrap the credentials file.CLI surface
Why
Previously the only way to authenticate to
pulsar-relaywas a pre-shared username/password baked intoapp.yml, which is awkward to provision and to rotate. The device flow leans on the relay's existing OIDC integration: an operator signs in interactively once, and the daemon receives a rotating refresh token it can use indefinitely without further human intervention.Backwards compatibility
RelayAuthManager(url, username, password)keeps working for existing deployments.pulsar-config --mqwithout--relay-urlstill produces the AMQP-flavouredapp.ymlit always did; only--mq --relay-urlswitches to the relay template.app.yml.sampleis updated to document the newmessage_queue_credentials_filekey alongside the legacymessage_queue_username/message_queue_password, with a comment recommending the credentials-file path for new deployments.