Multi-GPU Inference with Accelerate

Run inference faster by passing prompts to multiple GPUs in parallel.


Historically, more attention has been paid to distributed training than to distributed inference. After all, training is more computationally expensive. Larger and more complex LLMs, however, can take a long time to perform text completion tasks. Whether for research or in production, it's valuable to parallelize inference in order to maximize performance.

It's important to recognize that there's a difference between distributing the weights of one model across multiple GPUs and distributing model prompts or inputs across multiple models. The first is relatively simple, while the latter (which I'll be focusing on) is slightly more involved.

A week ago, in version 0.20.0, HuggingFace Accelerate released a feature that significantly simplifies multi-GPU inference: Accelerator.split_between_processes(). It's based on torch.distributed, but is much simpler to use.

Let's look at how we can use this new feature with LLaMA. The code will be written assuming that you've saved LLaMA weights in the Hugging Face Transformers format.

First, start out by importing the required modules and initializing the tokenizer and model.

from transformers import LlamaForCausalLM, LlamaTokenizer
from accelerate import Accelerator

accelerator = Accelerator()

MODEL_PATH = "path-to-llama-model"

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH)

model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")

Notice how we pass in device_map="auto". This allows Accelerate to evenly spread out the model's weights across available GPUs.

If we wanted, we could run model.to(accelerator.device). This would transfer the model to a specific GPU. accelerator.device will be different for each process running in parallel, so you could have a model loaded onto GPU 0, another loaded onto GPU 1, etc. In this case, though, we'll stick with device_map="auto". This allows us to use larger models than could fit on a single GPU.

Next, we'll write the code to perform inference!


data = [
"a dog",
"a cat",
"a vole",
"a bat",
"a bird",
"a fish",
"a horse",
"a cow",
"a sheep",
"a goat",
"a pig",
"a chicken",
]

# Accelerator will automatically split this data up between each running process.
# The above array has 12 items. So if we had 4 processes, each process
# would be assigned 3 strings as prompts.

with accelerator.split_between_processes(data,) as prompts:
for prompt in prompts:

# move the tensor to a GPU, since it needs to be on CUDA
inputs = tokenizer(prompt, return_tensors="pt").to(accelerator.device)

# inference
generate_ids = model.generate(**inputs, max_length=30)

# decoding
result = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# print the process number and inference result
print(
f"Process {accelerator.process_index} - "
+ result.replace("\n", "\\n")
)

Finally, all that remains is to launch Accelerate using the Accelerate CLI:

accelerate launch --num_processes=4 script.py

When we run the above code, 4 copies of the model are loaded across available GPUs. Our prompts are evenly split across the 4 models, which significantly improves performance.

The output of the above code (after logs from loading the models) should look like this:

Process 1 - a bathtub with a shower head a bathtub with a shower head bathtub with shower head and handheld
Process 0 - a dog's life, a dog's life, a dog's life, a dog's life, a dog's life
Process 1 - a bird in the hand is worth two in the bush.\na bird in the hand is worth two in the bush.\na bird in
Process 2 - a horse, a horse, my kingdom for a horse!\na horse, a horse, my kingdom for a horse!\na horse,
Process 3 - a goat, a sheep, a cow, a pig, a dog, a cat, a horse, a chicken, a du
Process 0 - a catastrophic event that will change the world forever.\nThe world is in the grip of a global pandemic.\nThe
Process 1 - a fishing trip to the Bahamas.\nI'm not sure if I'm going to be able to make it.\n
Process 2 - a coworker of mine is a big fan of the show and he's been trying to get me to watch it. I've
Process 3 - a piggy bank, a piggy bank, a piggy bank, a piggy bank, a piggy bank
Process 0 - a vole, a mouse, a shrew, a hamster, a gerbil, a guinea pig, a rabbit,
Process 2 - a sheep, a goat, a ram, a bullock, a he-lamb, a turtle-dove, and
Process 3 - a chicken in every pot, a car in every garage, a house in every backyard, a job for every man, a college

I hope this was helpful! You can learn more about Accelerate and distributed inference in the Accelerate documentation here if you're interested.

If you liked the article, don't forget to share it and follow me at @nebrelbug on Twitter.

© 2024 Ben Gubler