Skip to content

Add documentation for NEMO Pulsar relay mode#82

Open
dSizovs wants to merge 2 commits into
usegalaxy-eu:mainfrom
dSizovs:main
Open

Add documentation for NEMO Pulsar relay mode#82
dSizovs wants to merge 2 commits into
usegalaxy-eu:mainfrom
dSizovs:main

Conversation

@dSizovs

@dSizovs dSizovs commented Jun 11, 2026

Copy link
Copy Markdown

Adds operations documentation for the bwForCluster NEMO Pulsar endpoint,
which uses pulsar-relay (HTTP) instead of AMQP.

Comment thread nemo_pulsar_relay.md
pending messages in **Valkey** (Redis-compatible) so an in-flight job is not
lost across a relay restart.
- **Pulsar** runs on a NEMO login node. It long-polls the relay for new job
setup / status / kill messages, submits the actual work to **Slurm**, and

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After it sends it back to the relay it also transfers data. Job input data to slurm, but also results back to Galaxy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated!

Comment thread nemo_pulsar_relay.md Outdated
| Component | Host | Notes |
|-----------|------|-------|
| Galaxy runner `pulsar_eu_nemo` | usegalaxy.eu | Defined in `infrastructure-playbook` `job_conf.yml`; creds from vault |
| TPV destination `pulsar_nemo_tpv` | usegalaxy.eu | Defined in `infrastructure-playbook` `tpv/destinations.yml.j2`; tag `nemo-pulsar` |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to use real links here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked to the infrastructure-playbook. Left repo-level. If it makes sense to dig exact job-conf and tpv/destinations.yml.j2 paths I'll link those directly.

Comment thread nemo_pulsar_relay.md Outdated
|-----------|------|-------|
| Galaxy runner `pulsar_eu_nemo` | usegalaxy.eu | Defined in `infrastructure-playbook` `job_conf.yml`; creds from vault |
| TPV destination `pulsar_nemo_tpv` | usegalaxy.eu | Defined in `infrastructure-playbook` `tpv/destinations.yml.j2`; tag `nemo-pulsar` |
| pulsar-relay | bw-cloud VM | systemd service `pulsar-relay`, listens on `:9000`, Valkey backend |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this deployment defined?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link, deployed by pulsar-relay-role (now under usegalaxy-eu, yay🎉).

Comment thread nemo_pulsar_relay.md Outdated
this happens today:

1. **User-level opt-in**, a user selects the NEMO compute resource in
*User → Preferences → Manage Information → Use distributed compute

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this can be a link.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread nemo_pulsar_relay.md Outdated
1. **User-level opt-in**, a user selects the NEMO compute resource in
*User → Preferences → Manage Information → Use distributed compute
resources* ("Freiburg (Germany) - bwForCluster NEMO 2").
2. **Per-user TPV rule**, an entry in `tpv/users.yml` that attaches the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are multiple other ways, e.g. we could tag a specific tool to always go to Nemo

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added per tool routing as a third option, though I've only actually used the user opt-in and per-user rule myself.

Comment thread nemo_pulsar_relay.md Outdated

## Pulsar (NEMO login node)

NEMO does not provide user-level systemd, so Pulsar is kept alive by a small

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this wrapper script, how complex is it and should we maybe use supervisord instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in pulsar-nemo-login-role (templates/); a ~10 line while true loop, since NEMO has no user-level systemd. Added a note proposing supervisord as a cleaner replacement.

Comment thread nemo_pulsar_relay.md Outdated
message_queue_url: http://<relay-host>:9000/
message_queue_username: admin
message_queue_password: <in vault>
staging_directory: /home/.../pulsar/jobs_directory

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this HOME dir configurable in the playbook?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the pulsar_nemo_home role variable (default ~/pulsar), so it's configurable per deployment. Noted that in the updated doc.

Comment thread nemo_pulsar_relay.md

```bash
# is Pulsar running?
ps aux | grep pulsar-main | grep -v grep

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could then use supervisorctl here instead

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, flagged supervisord/supervisorctl as a future improvement.

Comment thread nemo_pulsar_relay.md Outdated
**Job stuck in "queued"/"running" forever, but Slurm shows COMPLETED**
Pulsar is submitting and the job finishes, but completion is not propagating
back. Confirm the Slurm CLI status plugin maps the "job no longer in squeue"
case to `complete`. (This was a real bug when Galaxy is importable in the same

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would you ever have Galaxy and puslar in the same env? Galaxy is not installed on the login node, isn't it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my surprise too. I verified on NEMO. The reason Galaxy is importable: the deployment uses pulsar-galaxy-lib (0.15.14), which bundles the Galaxy libs (galaxy-schema, galaxy-data, galaxy-tool-util, …). So in slurm.py the try: from galaxy.model import Job succeeds and job_states becomes Galaxy's enum, inspect.getfile(job_states) -> galaxy/schema/schema.py, where OK.value == 'ok'. The stateful manager compares against status.COMPLETE == 'complete' (pulsar/managers/status.py), so 'ok' != 'complete' and the job never deactivates.
A clean pulsar-app install hits the ImportError fallback (OK = 'complete') and never sees this, which is why org/AU are fine. Fix is at galaxyproject/pulsar#460

Comment thread nemo_pulsar_relay.md Outdated

**`No such transport: http`**
The installed Pulsar version is routing the relay URL through the AMQP/kombu
path. Use a Pulsar build with relay support that is compatible with the NEMO

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which version is that, make sure you are running puslar >= x.x

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known good is 0.15.15.dev0 on Python 3.9, noted in the troubleshooting section and pinned in the login role.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants