Multi-GPU Inference with Accelerate
Run inference faster by passing prompts to multiple GPUs in parallel.
Historically, more attention has been paid to distributed training than to distributed inference. After all, training is more computationally expensive. Larger and more complex LLMs, however, can take a long time to perform text completion tasks. Whether for research or in production, it's valuable to parallelize inference in order to maximize performance.
It's important to recognize that there's a difference between distributing the weights of one model across multiple GPUs and distributing model prompts or inputs across multiple models. The first is relatively simple, while the latter (which I'll be focusing on) is slightly more involved.
A week ago, in version 0.20.0, HuggingFace Accelerate released a feature that significantly simplifies multi-GPU inference:
Accelerator.split_between_processes(). It's based on
torch.distributed, but is much simpler to use.
Let's look at how we can use this new feature with LLaMA. The code will be written assuming that you've saved LLaMA weights in the Hugging Face Transformers format.
First, start out by importing the required modules and initializing the tokenizer and model.
Notice how we pass in
device_map="auto". This allows Accelerate to evenly spread out the model's weights across available GPUs.
If we wanted, we could run
model.to(accelerator.device). This would transfer the model to a specific GPU.
accelerator.device will be different for each process running in parallel, so you could have a model loaded onto GPU 0, another loaded onto GPU 1, etc. In this case, though, we'll stick with
device_map="auto". This allows us to use larger models than could fit on a single GPU.
Next, we'll write the code to perform inference!
Finally, all that remains is to launch Accelerate using the Accelerate CLI:
When we run the above code, 4 copies of the model are loaded across available GPUs. Our prompts are evenly split across the 4 models, which significantly improves performance.
The output of the above code (after logs from loading the models) should look like this:
I hope this was helpful! You can learn more about Accelerate and distributed inference in the Accelerate documentation here if you're interested.
If you liked the article, don't forget to share it and follow me at @nebrelbug on Twitter.