Hailstack is a command-line tool for building OpenStack images and provisioning Spark, Hadoop, Hail, and JupyterLab clusters on OpenStack. It validates a single TOML configuration file, uses compatibility bundles from bundles.toml to keep Hail and infrastructure versions aligned, stores Pulumi state and install rollout manifests in Ceph S3, and exposes the cluster services through an nginx-protected master node.
You need the following before you start:
- An OpenStack project with quota for one master VM, at least one worker VM, networking, floating IPs, and optional Cinder volumes.
- OpenStack credentials, either as a
~/.config/openstack/clouds.yamlfile or as exportedOS_*environment variables (typically via anopenrc.shfile from your OpenStack dashboard). - A Ceph S3 bucket reserved for Pulumi state and install rollout manifests, with an access key and secret key for that bucket.
- One or more SSH public keys to inject into cluster nodes.
- A Linux host with internet access where you will run the CLI. The CLI shells out to
pulumi,packer,ansible-playbook, andopenstack, so those tools must be available onPATH(the Apptainer image bundles them; otherwise install them yourself, see below).
You will also need values for these secrets, which the example config reads from the environment:
| Variable | Required when | Purpose |
|---|---|---|
HAILSTACK_WEB_PASSWORD |
always (for create) |
Jupyter password and nginx basic-auth password on the master. |
HAILSTACK_VOLUME_PASSWORD |
only when volumes.create = true or volumes.existing_volume_id is set; omit if you have no volume |
LUKS passphrase for the data volume. |
CEPH_S3_ACCESS_KEY / CEPH_S3_SECRET_KEY |
always | Credentials used by the CLI on your workstation to read/write the Pulumi state bucket (ceph_s3.bucket) and install rollout manifests. |
S3A_ACCESS_KEY / S3A_SECRET_KEY |
optional; only when Spark/Hadoop jobs need to read or write s3a://... data |
Credentials baked into Hadoop's core-site.xml on the cluster nodes so jobs can access your data buckets. |
The two S3 credential pairs serve different purposes and usually point at different buckets:
CEPH_S3_*is for the state bucket used by Hailstack itself (Pulumi backend + install rollout manifests). Give it a key scoped to that one bucket.S3A_*is for the data buckets your analyses read and write from inside Spark/Hadoop. Omit it if your workloads never touch S3.
They can be the same key if you want, but keeping them separate lets you rotate them independently and grant the cluster's data access more broadly than the CLI's state access.
There is currently no published hailstack.sif artifact. Pick one of the two installation paths below.
This path is the fastest if pulumi, packer, ansible-playbook, and openstack are already installed on your host (or you are happy to install them yourself).
-
Make sure the external tools are on
PATH. Adjust the example below to match where your site keeps them, or install them from upstream:# Example: prepend site-local installs to PATH. Replace paths as appropriate. export PATH="/path/to/python3.14/bin:$PATH" export PATH="/path/to/packer:$PATH" export PATH="/path/to/pulumi:$PATH" export PATH="/path/to/ansible/venv/bin:$PATH" # Sanity check python3.14 --version packer --version pulumi version ansible-playbook --version openstack --version # from the python-openstackclient package
-
Clone the repository and install Hailstack into a local virtual environment:
git clone https://github.com/wtsi-hgi/hailstack.git cd hailstack python3.14 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install .
hailstackis now on yourPATHfor as long as the venv is active. -
Verify the install:
hailstack --help
This path bundles Pulumi, Packer, Ansible, and the OpenStack client inside a single SIF, so the only host requirement is Apptainer (and Python 3.14 to build it).
-
Ensure Apptainer is available.
apptainer --version
-
Clone the repository and build the image (root is required by
apptainer build):git clone https://github.com/wtsi-hgi/hailstack.git cd hailstack mkdir -p dist sudo apptainer build dist/hailstack.sif Apptainer.def -
Create a shell alias so the
hailstackexamples below resolve to the SIF:alias hailstack='apptainer run /absolute/path/to/dist/hailstack.sif' hailstack --help
When you use the SIF, you do not need a Python venv on the host.
The steps below assume you have completed Installation and the hailstack command resolves to your venv install or your SIF alias.
-
Generate
~/.config/openstack/clouds.yamlfrom youropenrc.sh. Skip this step if you already have a workingclouds.yaml.source ./openrc.sh hailstack convert-auth --write -
Copy the example config and edit it. From inside a clone of the repository:
cp example-config.toml my-cluster.toml
Edit
my-cluster.tomland at minimum setcluster.name,cluster.master_flavour,cluster.network_name,cluster.bundle,ceph_s3.endpoint,ceph_s3.bucket, andssh_keys.public_keys. Review every section listed in Configuration Reference before runningcreate.If you installed Hailstack via the SIF and do not have the repository checked out, copy
example-config.tomlout of the image first:apptainer exec dist/hailstack.sif cat /opt/hailstack/example-config.toml > my-cluster.toml
-
Create a
.envfile next tomy-cluster.tomlto hold the secrets your TOML references with$VAR. Keep this file out of version control. OnlyHAILSTACK_WEB_PASSWORDand theCEPH_S3_*pair are always required; omitHAILSTACK_VOLUME_PASSWORDif you are not using a data volume, and omit theS3A_*pair if your workloads do not need S3A access. Remove the matching$VARreferences frommy-cluster.tomlas well when you drop a variable.cat > .env <<'EOF' HAILSTACK_WEB_PASSWORD=choose-a-strong-password CEPH_S3_ACCESS_KEY=your-ceph-state-access-key CEPH_S3_SECRET_KEY=your-ceph-state-secret-key # Optional: only needed if volumes.create=true or volumes.existing_volume_id is set HAILSTACK_VOLUME_PASSWORD=choose-a-volume-passphrase # Optional: only needed if Spark/Hadoop jobs access s3a:// data S3A_ACCESS_KEY=your-s3a-data-access-key S3A_SECRET_KEY=your-s3a-data-secret-key EOF chmod 600 .env
-
Build the OpenStack image for your chosen bundle. You only need to repeat this when you switch to a bundle that does not already have a matching
hailstack-<bundle-id>image in Glance, or when you change the base image.hailstack build-image --config my-cluster.toml --dotenv .env
-
Preview the cluster (no resources are created):
hailstack create --config my-cluster.toml --dotenv .env --dry-run
-
Create the cluster:
hailstack create --config my-cluster.toml --dotenv .env
-
When
createfinishes it prints the master floating IP. Open JupyterLab through the nginx proxy:https://<master-ip>/jupyter/The basic-auth username is
hailstackand the password isHAILSTACK_WEB_PASSWORD. -
When you are done with the cluster, tear it down:
hailstack destroy --config my-cluster.toml --dotenv .env
destroyasks you to type the cluster name to confirm.
The CLI reads TOML with tomllib, substitutes $VAR and ${VAR} in string values, optionally loads a .env file before parsing, and then validates the result with Pydantic. Defaults below are the current implementation defaults, not recommendations.
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
cluster.name |
Cluster identifier used in OpenStack resource names and the Pulumi stack name. Must match ^[a-z][a-z0-9-]{1,62}$. |
string |
required | my-cluster |
cluster.bundle |
Compatibility bundle ID. Leave empty to use [default].bundle from bundles.toml for bundle-aware commands. |
string |
"" |
hail-0.2.137-gnomad-3.0.4-r2 |
cluster.num_workers |
Number of worker nodes to create. Must be at least 1. |
integer |
2 |
4 |
cluster.master_flavour |
OpenStack flavour for the master VM. | string |
required | m2.2xlarge |
cluster.worker_flavour |
OpenStack flavour for workers. If omitted, Hailstack reuses cluster.master_flavour. |
string |
"" then resolved to cluster.master_flavour |
m2.2xlarge |
cluster.network_name |
Required OpenStack network for the management interface on all nodes. | string |
cloudforms_network |
cloudforms_network |
cluster.lustre_network |
Optional second network name added to every node for Lustre access. | string |
"" |
lustre_network |
cluster.lustre_mount_target |
Lustre mount target written to /etc/fstab when cluster.lustre_network is set. |
string |
10.1.0.1@tcp:/fsx |
192.0.2.10@tcp:/fsx |
cluster.ssh_username |
Login user used by SSH-based commands and cloud-init paths. | string |
ubuntu |
ubuntu |
cluster.monitoring |
Monitoring mode. Only netdata and none are accepted. |
string |
netdata |
netdata |
cluster.floating_ip |
Existing unassociated IPv4 address to attach to the master. Leave empty to allocate one. | string |
"" |
1.2.3.4 |
cluster.floating_ip_pool |
Floating IP pool name used when create needs to allocate a master IP instead of reusing cluster.floating_ip. |
string |
"" |
public |
[packer] is optional overall, but build-image requires it.
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
packer.base_image |
Source image name that Packer should boot before provisioning the Hailstack image. | string |
required when [packer] is present |
ubuntu-22.04 |
packer.flavour |
OpenStack flavour used during the image build. | string |
m2.medium |
m2.medium |
packer.floating_ip_pool |
Floating IP pool name used by Packer while building an image. create uses cluster.floating_ip_pool, but the legacy packer field is still accepted for backwards compatibility. |
string |
"" |
public |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
volumes.create |
Create and attach a new Cinder data volume on the master. Mutually exclusive with volumes.existing_volume_id. |
boolean |
false |
true |
volumes.name |
Name to assign to a newly created Cinder volume. | string |
"" |
my-data-vol |
volumes.size_gb |
Size of a newly created Cinder volume in GiB. | integer |
100 |
500 |
volumes.preserve_on_destroy |
Keep a newly created volume when destroy removes the cluster. |
boolean |
false |
false |
volumes.existing_volume_id |
Existing Cinder volume UUID to attach instead of creating a new one. | string |
"" |
uuid |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
s3.endpoint |
S3A endpoint injected into Hadoop configuration on cluster nodes. | string |
"" |
cog.sanger.ac.uk |
s3.access_key |
Access key for runtime S3A access from Spark and Hadoop. Supports env-var substitution. | string |
"" |
$S3A_ACCESS_KEY |
s3.secret_key |
Secret key for runtime S3A access from Spark and Hadoop. Supports env-var substitution. | string |
"" |
$S3A_SECRET_KEY |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
ceph_s3.endpoint |
Ceph S3 endpoint used for the Pulumi backend and install rollout uploads. | string |
"" |
cog.sanger.ac.uk |
ceph_s3.bucket |
Bucket that stores Pulumi state and rollout manifests. | string |
hailstack-state |
hailstack-state |
ceph_s3.access_key |
Ceph S3 access key. Required for Pulumi-backed commands such as create, destroy, status, reboot, and for rollout uploads during install. |
string |
"" |
$CEPH_S3_ACCESS_KEY |
ceph_s3.secret_key |
Ceph S3 secret key paired with ceph_s3.access_key. Required anywhere the Pulumi backend or rollout storage is used. |
string |
"" |
$CEPH_S3_SECRET_KEY |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
ssh_keys.public_keys |
Public keys written into authorized_keys on all nodes. At least one non-empty key is required. The first key is also used for the OpenStack keypair resource. |
list[string] |
empty list, but validation requires at least one key | ["ssh-rsa AAAA... user@host1", "ssh-rsa BBBB... user@host2"] |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
security_groups.master.ssh |
Expose TCP 22 on the master. |
boolean |
true; if you provide a partial [security_groups.master] table, omitted fields fall back to the base security-group defaults instead of the master preset |
true |
security_groups.master.http |
Expose TCP 80 on the master. |
boolean |
true; omitted inside a partial [security_groups.master] table resolves to false |
true |
security_groups.master.https |
Expose TCP 443 on the master. |
boolean |
true; omitted inside a partial [security_groups.master] table resolves to false |
true |
security_groups.master.spark_master |
Expose the Spark master RPC port on TCP 7077. |
boolean |
true; omitted inside a partial [security_groups.master] table resolves to false |
true |
security_groups.master.jupyter |
Expose the Jupyter backend port on TCP 8888. The usual public path is still the nginx proxy at /jupyter/. |
boolean |
true; omitted inside a partial [security_groups.master] table resolves to false |
true |
security_groups.master.hdfs |
Expose the HDFS namenode service port on TCP 9820. |
boolean |
true; omitted inside a partial [security_groups.master] table resolves to false |
true |
security_groups.master.netdata |
Expose Netdata on TCP 19999 when monitoring is enabled. |
boolean |
true; omitted inside a partial [security_groups.master] table resolves to false |
true |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
security_groups.worker.spark_worker |
Expose TCP 7078-7099 on workers for Spark worker processes. |
boolean |
true; omitted inside a partial [security_groups.worker] table resolves to false |
true |
security_groups.worker.hdfs |
Expose TCP 9866 on workers for HDFS datanodes. |
boolean |
true; omitted inside a partial [security_groups.worker] table resolves to false |
true |
security_groups.worker.all_tcp_internal |
Allow all internal TCP traffic from master and worker security groups to worker nodes. | boolean |
true; omitted inside a partial [security_groups.worker] table resolves to false |
true |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
dns.search_domains |
Search domain string appended to /etc/resolv.conf on cluster nodes. |
string |
"" |
internal.sanger.ac.uk |
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
extras.system_packages |
Extra Debian packages installed by cloud-init during cluster creation. | list[string] |
empty list | ["libpq-dev", "mc"] |
extras.python_packages |
Extra Python packages installed into the overlay virtual environment during cluster creation. | list[string] |
empty list | ["pandas", "scikit-learn"] |
Every command below is available from hailstack --help. Examples assume you are running the Apptainer alias from Quick Start.
Synopsis: Create a new cluster from a TOML configuration file.
Options:
| Option | Meaning |
|---|---|
--config PATH |
Path to the cluster configuration TOML file. |
--dry-run |
Validate the config, resolve the bundle, run OpenStack pre-flight checks, and show the Pulumi preview without creating resources. |
--dotenv PATH |
Load environment variables from a .env file before parsing the config. |
Examples:
hailstack create --config my-cluster.toml --dotenv .env
hailstack create --config my-cluster.toml --dotenv .env --dry-runSynopsis: Destroy an existing cluster from a TOML configuration file.
Options:
| Option | Meaning |
|---|---|
--config PATH |
Path to the cluster configuration TOML file. |
--dry-run |
Show the destroy preview without deleting resources. |
--dotenv PATH |
Load environment variables from a .env file before parsing the config. |
Examples:
hailstack destroy --config my-cluster.toml --dotenv .env
hailstack destroy --config my-cluster.toml --dotenv .env --dry-rundestroy requires an exact cluster-name confirmation unless --dry-run is used.
Synopsis: Reboot worker nodes in an existing cluster.
Options:
| Option | Meaning |
|---|---|
--config PATH |
Path to the cluster configuration TOML file. Defaults to ./hailstack.toml. |
--node TEXT |
Optional worker name selector. If omitted, all workers are rebooted. The master cannot be rebooted through this command. |
--ssh-key PATH |
SSH private key path. If omitted, the command relies on your SSH agent. |
--dotenv PATH |
Load environment variables from a .env file before parsing the config. |
Examples:
hailstack reboot --config my-cluster.toml --dotenv .env
hailstack reboot --config my-cluster.toml --dotenv .env --ssh-key ~/.ssh/my-cluster-key
hailstack reboot --config my-cluster.toml --dotenv .env --node my-cluster-worker-02Synopsis: Build a Hailstack image for a selected compatibility bundle.
Options:
| Option | Meaning |
|---|---|
--config PATH |
Path to the cluster configuration TOML file. Defaults to ./hailstack.toml. |
--bundle TEXT |
Bundle ID override. If omitted, the command uses cluster.bundle or the default bundle. |
--dotenv PATH |
Load environment variables from a .env file before parsing the config. |
Examples:
hailstack build-image --config my-cluster.toml
hailstack build-image --config my-cluster.toml --bundle hail-0.2.137-gnomad-3.0.4-r2Synopsis: Install extra packages on all existing cluster nodes and upload a rollout manifest to Ceph S3.
Options:
| Option | Meaning |
|---|---|
--config PATH |
Path to the cluster configuration TOML file. Defaults to ./hailstack.toml. |
--system TEXT |
One or more system packages to install with apt. Repeat the option to add more packages. |
--python TEXT |
One or more Python packages to install with uv. Repeat the option to add more packages. |
--file PATH |
TOML package list file merged with --system and --python. |
--smoke-test TEXT |
Command to run on every node after installation. |
--ssh-key PATH |
SSH private key path. If omitted, the command relies on your SSH agent. |
--dotenv PATH |
Load environment variables from a .env file before parsing the config. |
Examples:
hailstack install --config my-cluster.toml --dotenv .env --system mc --system tree
hailstack install --config my-cluster.toml --dotenv .env --python pandas --smoke-test "python -c 'import pandas'"Synopsis: Show the current cluster summary and, optionally, detailed SSH health information.
Options:
| Option | Meaning |
|---|---|
--config PATH |
Path to the cluster configuration TOML file. Defaults to ./hailstack.toml. |
--detailed |
Probe services and resource usage over SSH. |
--json |
Emit JSON instead of the plain-text summary. |
--dotenv PATH |
Load environment variables from a .env file before parsing the config. |
Examples:
hailstack status --config my-cluster.toml --dotenv .env
hailstack status --config my-cluster.toml --dotenv .env --detailed --jsonSynopsis: Convert current OS_* environment variables into an OpenStack clouds.yaml document.
Options:
| Option | Meaning |
|---|---|
--write |
Write the generated YAML to ~/.config/openstack/clouds.yaml, backing up any existing file as clouds.yaml.bak.<timestamp>, without echoing credentials to stdout. |
Examples:
hailstack convert-auth
# prints the generated YAML to stdout
hailstack convert-auth --write
# writes ~/.config/openstack/clouds.yaml onlyHailstack ships a flat compatibility matrix in bundles.toml. Each bundle has an explicit ID such as hail-0.2.137-gnomad-3.0.4-r2, where the -rN suffix distinguishes infrastructure revisions that keep the same Hail and gnomAD pairing but change supporting components such as Spark or Hadoop.
Current checked-in bundles are:
| Bundle ID | Hail | Spark | Hadoop | Java | Python | Scala | gnomAD | Status |
|---|---|---|---|---|---|---|---|---|
hail-0.2.136-gnomad-3.0.4-r1 |
0.2.136 |
3.5.4 |
3.4.0 |
11 |
3.12 |
2.12.18 |
3.0.4 |
supported |
hail-0.2.137-gnomad-3.0.4-r1 |
0.2.137 |
3.5.4 |
3.4.0 |
11 |
3.12 |
2.12.18 |
3.0.4 |
supported |
hail-0.2.137-gnomad-3.0.4-r2 |
0.2.137 |
3.5.6 |
3.4.1 |
11 |
3.12 |
2.12.18 |
3.0.4 |
latest |
How Hailstack uses bundles:
build-imageandcreateresolvecluster.bundle. If it is blank, they fall back to[default].bundle.- OpenStack image names are expected to match
hailstack-<bundle-id>. - Bundle fields map directly to the image build inputs:
hail,spark,hadoop,java,python,scala,gnomad, andstatus. - There is no dedicated
hailstacksubcommand to list bundles today. Inspect the shippedbundles.tomlin the repository or in the container image when choosing a bundle.
Shared storage is driven by the [volumes] section.
- If
volumes.create = true, Hailstack creates a new Cinder volume, attaches it to the master as/dev/vdb, encrypts it with LUKS usingHAILSTACK_VOLUME_PASSWORD, formats it as ext4 on first use, and mounts it at/home/<ssh_username>/data. - If
volumes.existing_volume_idis set, Hailstack attaches the existing volume and opens it with the sameHAILSTACK_VOLUME_PASSWORD, but it does not runluksFormatagain. - When a data volume is enabled, the master exports
/home/<ssh_username>/dataover NFS and workers mountmaster:/home/<ssh_username>/dataautomatically. volumes.preserve_on_destroy = trueonly affects newly created volumes. It tells Pulumi to retain the created volume when the cluster is destroyed.- If neither
volumes.createnorvolumes.existing_volume_idis set, no shared data volume or NFS export is configured.
The current implementation exposes services through security-group toggles plus nginx on the master node.
HAILSTACK_WEB_PASSWORDis mandatory for cluster creation because cloud-init generates both the Jupyter password hash and the nginx basic-auth file.- nginx listens on
80and443, proxies/jupyter/,/spark/,/sparkhist/,/yarn/,/mapreduce/,/hdfs/, per-worker/nmNN/pages, and/netdata/when monitoring is enabled. - The basic-auth username is
hailstack. - cloud-init generates a self-signed TLS certificate on the master at
/etc/nginx/ssl/hailstack.crtwith the matching key at/etc/nginx/ssl/hailstack.key. - All
ssh_keys.public_keysentries are written toauthorized_keyson the cluster nodes. The first key is also registered as the OpenStack keypair. - Public ingress is controlled by the
security_groups.master.*andsecurity_groups.worker.*booleans. Worker nodes default to internal-only access apart from the Spark worker and HDFS data-node ports.
Monitoring is controlled by cluster.monitoring.
netdatais the default. The master enables Netdata locally and accepts streaming data from workers through a shared API key generated at cluster creation time.- Workers stream Netdata metrics to the master; they do not need a separate dashboard URL when the nginx proxy is in use.
- When monitoring is enabled, the dashboard is available at
https://<master-ip>/netdata/behind the same nginx authentication as JupyterLab. - If
cluster.monitoring = "none", Hailstack omits Netdata service setup and the/netdata/proxy path. hailstack status --detailedis separate from Netdata. It probes service health and resource usage over SSH even if Netdata is disabled.
| Symptom | Likely cause | What to check |
|---|---|---|
Bundle '<id>' not found |
cluster.bundle does not match any ID in bundles.toml. |
Pick one of the bundle IDs listed in the Compatibility Bundles section or leave cluster.bundle empty to use the default bundle. |
Ceph S3 credentials required for Pulumi state backend |
ceph_s3.endpoint, ceph_s3.bucket, ceph_s3.access_key, or ceph_s3.secret_key is blank after env substitution. |
Check the TOML values and the .env file passed through --dotenv. |
HAILSTACK_WEB_PASSWORD required |
The environment variable was not set before create. |
Export the variable directly or add it to the .env file used by create. |
HAILSTACK_VOLUME_PASSWORD required when a data volume is attached |
A volume is enabled, but the LUKS passphrase is missing. | Add HAILSTACK_VOLUME_PASSWORD to your environment or disable the volume section. |
OpenStack CLI not found |
The host environment lacks the openstack client. |
Use the packaged container entrypoint or install the expected OpenStack CLI in the environment that runs Hailstack. |
Network '<name>' not found, missing image, or missing floating IP errors during create |
Pre-flight resource checks failed. | Verify cluster.network_name, cluster.lustre_network, the built image name hailstack-<bundle-id>, and any fixed cluster.floating_ip. |
Timed out waiting for SSH connectivity to return during reboot |
The worker did not come back cleanly after reboot or SSH access is blocked. | Check the instance console, confirm the security-group SSH setting, and verify the configured cluster.ssh_username. |
Cluster not found from status |
The Pulumi stack for that cluster name does not exist in the configured Ceph backend. | Re-check cluster.name, Ceph S3 credentials, and whether the cluster was already destroyed. |
Do I need to rebuild the image every time I create a cluster?
No. Rebuild when you switch to a bundle that does not already have a matching hailstack-<bundle-id> image or when you change the base image and want a fresh artifact.
Can I reuse an existing floating IP?
Yes. Set cluster.floating_ip to an existing unassociated IPv4 address. Leave it blank to let Pulumi allocate one, optionally using cluster.floating_ip_pool.
Can I turn monitoring off?
Yes. Set cluster.monitoring = "none". status --detailed still works because it uses SSH probes rather than Netdata.
Can I add software after the cluster is running?
Yes. Use hailstack install with --system, --python, or --file. The command uploads a rollout manifest to the configured Ceph S3 bucket.
Where does cluster state live?
Pulumi state and install rollout manifests live in the ceph_s3.bucket bucket, not in local state files.
Can I attach an existing data volume instead of creating one?
Yes. Set volumes.existing_volume_id and leave volumes.create = false. You still need HAILSTACK_VOLUME_PASSWORD so the cluster can open the LUKS volume on boot.