Enroot on Slurm for Distributed ML: Part 2
How to use Enroot on Slurm for containerized multi-node training.
This is part 2 of a 2-part series. Part 1 is available here.
In part 1, we covered how to use Enroot on Slurm for containerized single-node training using
salloc. In this post, we'll cover how to use Enroot on Slurm for containerized multi-node training, and transition to using
Step 1: Slurm Launch Script
We'll end up creating several Bash files, all of which should be in the same directory as your training script. The first will be a Slurm launch file that we'll run with
sbatch. This file will contain the same commands we ran with
salloc in part 1, but declared using
#SBATCH processing directives.
Note that we create a variable
CUR_DIR to store the current working directory (the directory where the
sbatch command was run). I use this variable to share the location of my training directory between scripts, so I don't have to hard-code paths. But it's not required.
Slurm will automatically pass local environment variables through to the
srun command, which will run the
stage1.sh script on each node.
Step 2. Enroot Launch Script
Next, we'll create a script that will be run on each node. This script will be responsible for launching the container and running the training script. We'll call this script
Note that we pass several important environment variables provided by Slurm, along with
CUR_DIR, into the container. The
MASTER_PORT variables are used by PyTorch's distributed training backend to coordinate communication between nodes.
We also mount a local file path into the container (make sure it contains your training script!).
Step 3. Training Script
Finally, we'll create a training script that will be run inside the container. We'll call this script
Here I've used accelerate as a launcher for my distributed training script, but you can use whatever launcher you want. Just make sure you pass relevant environment variables through!
For the sake of completeness, here's my
accelerate_config.yaml file. It utilizes FSDP (Fully Sharded Data Parallel) to split model parameters and gradients across processes. This is a great way to train large models that won't fit on just one GPU.
Step 4. Submit the Job
Now that we've created all the necessary scripts, we can submit the job to Slurm using
sbatch! From the directory containing the scripts, run:
Your job will be submitted to Slurm and run as soon as resources are available. Output logs will be stored at
slurm-<jobid>.out in the current directory.
I hope this was helpful! There are many parts involved in getting distributed training working, but it's not too difficult once you get over the initial learning curve.
If you liked the article, don't forget to share it and follow me at @nebrelbug on Twitter.