Summary
At NERSC, testing of the gen3_workflow is confined to CVMFS due to the lack of Slurm support within our Shifter images. There is also ongoing effort at Cambridge to run the workflow using Singularity images rather than be forced to set up the software locally. The ultimate goal is to allow the parsl workflow to use our images and allow the use of Slurm for batch submission. Ben and Tom have achieved this with their gen2 Run2.2i DR2/Run3.1i DR3 parsl workflow - but the use of Shifter is not known to the workflow or even Slurm, rather the parsl workers happen to run their commands in a shifter container.
There are SPANK plugins to Slurm for both Shifter and Singularity. To use them requires setting up sbatch scripts appropriately to use specific images. I could imagine a modified SlurmProvider that sets up the SBATCH commands to submit a job utilizing a container, such as: #SBATCH --image=docker:yourRepo/yourImage:latest. Maybe that is enough to start.
Alternatively, we can seek to install Slurm inside our images, which would allow the submit side code where the workflow python script runs, to run inside the container. It might be nice to have the same environment for both the submit side and the parsl-executed tasks. It would be worthwhile to talk this through more with the Parsl developers, esp Ben Clifford, to see how beneficial this might be. One suggested example could be enabling the workflow itself to interact with the Butler to determine what parsl tasks are started and their configuration.
Shifter
I have spent some time looking at installing Slurm inside Shifter and reached out to NERSC specifically. The Shifter developers addressed this question directly in their documentation, noting that submitting jobs from within containers is not enabled. NERSC also has some dedicated Parsl documentation where they walk through examples, without Shifter.
After chatting with Brian Van Klaveren, I created a new LSST Science Pipelines docker image based on opensuse/leap:15.2 which seems closest to the OS at NERSC. The source build of the LSST Science Pipelines was successful. The next step would be to install Slurm using the same version and general configuration at NERSC.
Then stumbled upon this doc concerning installing NERSC libraries within a Shifter image. I have attempted to utilize their script to gather NERSC's srun environment and unfortunately, have run into the problem that scanelf (in their shifterize.sh script) is not available. My attempts to install pax-utils (which includes scanelf) locally using zypper have failed. So I'm not so sure this avenue will work, and given NERSC's reluctance concerning installing Slurm inside the containers, I'm hesitant to reach out to NERSC support. Now located the scanelf source code and will try that out.
Went back to install slurm into the opensuse/leap:15.2 image results in some errors: Failed to connect to bus: No such file or directory. Found an issue that seems related here, where they worked around it by using ssh to submit their singularity jobs. Just a note that in the case of Shifter images, sshd is disabled unless you turn on the --ccm flag. More work is necessary to get the image set up appropriately.
A Shifter developer also responded to my NERSC ticket:
My understanding is that the submission clients and configuration have to be closely matched to the server. So if Slurm were just installed with standard RPMs for example, there could be protocol mismatches. I've toyed with an idea of how to work around this. It would involve having some light-weight daemon running outside the container listening to a socket file inside the container. This would allow requests to cross the boundary. So the Slurm clients would still be provided by the system. This approach could even be extended to allow running containers in containers which is not possible with Shifter at the moment. The only reason I haven't pursued this yet is just time. If there is a strong interest, I could revisit this and tried to carve out some cycles.
If we wish to pursue this, I think it would be helpful to involve the Parsl team as they have already interacted with NERSC to develop that documentation linked above and see if the Shifter developers can be persuaded this is worth their effort.
I think for the short-term, we update the SlurmProvider to use the Shifter images and look into getting some help from the Shifter developers on this front.
Singularity
Little more optimistic about this path, but have not pursued it myself. Some discussion was found in a web search.
It would be interesting to hear if James Perry and the folks at Cambridge have some thoughts on this. We could create a docker image (or Singularity directly) that is based on an OS that is closer to what is running at Cambridge. What OS would be recommended?
To Do
Summary
At NERSC, testing of the gen3_workflow is confined to CVMFS due to the lack of Slurm support within our Shifter images. There is also ongoing effort at Cambridge to run the workflow using Singularity images rather than be forced to set up the software locally. The ultimate goal is to allow the parsl workflow to use our images and allow the use of Slurm for batch submission. Ben and Tom have achieved this with their gen2 Run2.2i DR2/Run3.1i DR3 parsl workflow - but the use of Shifter is not known to the workflow or even Slurm, rather the parsl workers happen to run their commands in a shifter container.
There are SPANK plugins to Slurm for both Shifter and Singularity. To use them requires setting up sbatch scripts appropriately to use specific images. I could imagine a modified SlurmProvider that sets up the SBATCH commands to submit a job utilizing a container, such as:
#SBATCH --image=docker:yourRepo/yourImage:latest. Maybe that is enough to start.Alternatively, we can seek to install Slurm inside our images, which would allow the submit side code where the workflow python script runs, to run inside the container. It might be nice to have the same environment for both the submit side and the parsl-executed tasks. It would be worthwhile to talk this through more with the Parsl developers, esp Ben Clifford, to see how beneficial this might be. One suggested example could be enabling the workflow itself to interact with the Butler to determine what parsl tasks are started and their configuration.
Shifter
I have spent some time looking at installing Slurm inside Shifter and reached out to NERSC specifically. The Shifter developers addressed this question directly in their documentation, noting that submitting jobs from within containers is not enabled. NERSC also has some dedicated Parsl documentation where they walk through examples, without Shifter.
After chatting with Brian Van Klaveren, I created a new LSST Science Pipelines docker image based on
opensuse/leap:15.2which seems closest to the OS at NERSC. The source build of the LSST Science Pipelines was successful. The next step would be to install Slurm using the same version and general configuration at NERSC.Then stumbled upon this doc concerning installing NERSC libraries within a Shifter image. I have attempted to utilize their script to gather NERSC's
srunenvironment and unfortunately, have run into the problem thatscanelf(in their shifterize.sh script) is not available. My attempts to installpax-utils(which includes scanelf) locally usingzypperhave failed. So I'm not so sure this avenue will work, and given NERSC's reluctance concerning installing Slurm inside the containers, I'm hesitant to reach out to NERSC support. Now located the scanelf source code and will try that out.Went back to install slurm into the opensuse/leap:15.2 image results in some errors:
Failed to connect to bus: No such file or directory. Found an issue that seems related here, where they worked around it by usingsshto submit their singularity jobs. Just a note that in the case of Shifter images,sshdis disabled unless you turn on the--ccmflag. More work is necessary to get the image set up appropriately.A Shifter developer also responded to my NERSC ticket:
If we wish to pursue this, I think it would be helpful to involve the Parsl team as they have already interacted with NERSC to develop that documentation linked above and see if the Shifter developers can be persuaded this is worth their effort.
I think for the short-term, we update the SlurmProvider to use the Shifter images and look into getting some help from the Shifter developers on this front.
Singularity
Little more optimistic about this path, but have not pursued it myself. Some discussion was found in a web search.
It would be interesting to hear if James Perry and the folks at Cambridge have some thoughts on this. We could create a docker image (or Singularity directly) that is based on an OS that is closer to what is running at Cambridge. What OS would be recommended?
To Do