Add documentation for NEMO Pulsar relay mode#82
Conversation
| pending messages in **Valkey** (Redis-compatible) so an in-flight job is not | ||
| lost across a relay restart. | ||
| - **Pulsar** runs on a NEMO login node. It long-polls the relay for new job | ||
| setup / status / kill messages, submits the actual work to **Slurm**, and |
There was a problem hiding this comment.
After it sends it back to the relay it also transfers data. Job input data to slurm, but also results back to Galaxy.
| | Component | Host | Notes | | ||
| |-----------|------|-------| | ||
| | Galaxy runner `pulsar_eu_nemo` | usegalaxy.eu | Defined in `infrastructure-playbook` `job_conf.yml`; creds from vault | | ||
| | TPV destination `pulsar_nemo_tpv` | usegalaxy.eu | Defined in `infrastructure-playbook` `tpv/destinations.yml.j2`; tag `nemo-pulsar` | |
There was a problem hiding this comment.
would it make sense to use real links here?
There was a problem hiding this comment.
Linked to the infrastructure-playbook. Left repo-level. If it makes sense to dig exact job-conf and tpv/destinations.yml.j2 paths I'll link those directly.
| |-----------|------|-------| | ||
| | Galaxy runner `pulsar_eu_nemo` | usegalaxy.eu | Defined in `infrastructure-playbook` `job_conf.yml`; creds from vault | | ||
| | TPV destination `pulsar_nemo_tpv` | usegalaxy.eu | Defined in `infrastructure-playbook` `tpv/destinations.yml.j2`; tag `nemo-pulsar` | | ||
| | pulsar-relay | bw-cloud VM | systemd service `pulsar-relay`, listens on `:9000`, Valkey backend | |
There was a problem hiding this comment.
where is this deployment defined?
There was a problem hiding this comment.
Added a link, deployed by pulsar-relay-role (now under usegalaxy-eu, yay🎉).
| this happens today: | ||
|
|
||
| 1. **User-level opt-in**, a user selects the NEMO compute resource in | ||
| *User → Preferences → Manage Information → Use distributed compute |
| 1. **User-level opt-in**, a user selects the NEMO compute resource in | ||
| *User → Preferences → Manage Information → Use distributed compute | ||
| resources* ("Freiburg (Germany) - bwForCluster NEMO 2"). | ||
| 2. **Per-user TPV rule**, an entry in `tpv/users.yml` that attaches the |
There was a problem hiding this comment.
there are multiple other ways, e.g. we could tag a specific tool to always go to Nemo
There was a problem hiding this comment.
Added per tool routing as a third option, though I've only actually used the user opt-in and per-user rule myself.
|
|
||
| ## Pulsar (NEMO login node) | ||
|
|
||
| NEMO does not provide user-level systemd, so Pulsar is kept alive by a small |
There was a problem hiding this comment.
Where is this wrapper script, how complex is it and should we maybe use supervisord instead?
There was a problem hiding this comment.
It's in pulsar-nemo-login-role (templates/); a ~10 line while true loop, since NEMO has no user-level systemd. Added a note proposing supervisord as a cleaner replacement.
| message_queue_url: http://<relay-host>:9000/ | ||
| message_queue_username: admin | ||
| message_queue_password: <in vault> | ||
| staging_directory: /home/.../pulsar/jobs_directory |
There was a problem hiding this comment.
Is this HOME dir configurable in the playbook?
There was a problem hiding this comment.
Yes, it's the pulsar_nemo_home role variable (default ~/pulsar), so it's configurable per deployment. Noted that in the updated doc.
|
|
||
| ```bash | ||
| # is Pulsar running? | ||
| ps aux | grep pulsar-main | grep -v grep |
There was a problem hiding this comment.
we could then use supervisorctl here instead
There was a problem hiding this comment.
Agree, flagged supervisord/supervisorctl as a future improvement.
| **Job stuck in "queued"/"running" forever, but Slurm shows COMPLETED** | ||
| Pulsar is submitting and the job finishes, but completion is not propagating | ||
| back. Confirm the Slurm CLI status plugin maps the "job no longer in squeue" | ||
| case to `complete`. (This was a real bug when Galaxy is importable in the same |
There was a problem hiding this comment.
I would you ever have Galaxy and puslar in the same env? Galaxy is not installed on the login node, isn't it?
There was a problem hiding this comment.
That was my surprise too. I verified on NEMO. The reason Galaxy is importable: the deployment uses pulsar-galaxy-lib (0.15.14), which bundles the Galaxy libs (galaxy-schema, galaxy-data, galaxy-tool-util, …). So in slurm.py the try: from galaxy.model import Job succeeds and job_states becomes Galaxy's enum, inspect.getfile(job_states) -> galaxy/schema/schema.py, where OK.value == 'ok'. The stateful manager compares against status.COMPLETE == 'complete' (pulsar/managers/status.py), so 'ok' != 'complete' and the job never deactivates.
A clean pulsar-app install hits the ImportError fallback (OK = 'complete') and never sees this, which is why org/AU are fine. Fix is at galaxyproject/pulsar#460
|
|
||
| **`No such transport: http`** | ||
| The installed Pulsar version is routing the relay URL through the AMQP/kombu | ||
| path. Use a Pulsar build with relay support that is compatible with the NEMO |
There was a problem hiding this comment.
which version is that, make sure you are running puslar >= x.x
There was a problem hiding this comment.
Known good is 0.15.15.dev0 on Python 3.9, noted in the troubleshooting section and pinned in the login role.
Adds operations documentation for the bwForCluster NEMO Pulsar endpoint,
which uses pulsar-relay (HTTP) instead of AMQP.