Quick & Helpful Slurm Commands
A quick guide to using Slurm for distributed machine learning.
In the lab I work in, we have access to a High Performance Computing (HPC) environment that uses the Slurm Workload Manager.
I've been using it for a while now, and I've found a few commands that I use all the time. I thought I'd share them here in case they're useful to anyone else.
What if all of your compute nodes are allocated, or you don't want your job to exit as soon as your terminal connection is closed? In that case, you can use
sbatch to submit a job to the queue. It will automatically run as soon as it can allocate the resources.
This will take slightly more setup. Assume that the job we actually want to run is contained in
myjob.sh. In order to submit that script as a job, we'll first create a Bash script that will be run by Slurm. Let's call it
Note that we're using the
#SBATCH processing directive to pass in the parameters that we would have passed to
salloc before. We're also using
srun to run our actual job; it will handle running the script across multiple nodes, if we so desire.
Finally, to launch our script, we'll run:
That's it! I hope this was helpful. If you have any questions, you can ask ChatGPT or Bard (they'll give either incredibly helpful or completely incorrect answers, but it's worth a shot!)
If you liked the article, don't forget to share it and follow me at @nebrelbug on Twitter.