Multi-GPUs Fine-tuning for Llama 3.1 70B with FSDP and QLoRA

What you can do with only 2×24 GB GPUs and a lot of CPU RAM

Fine-tuning large language models (LLMs) with up to 35B parameters is relatively easy and cheap since it can be done with a single consumer GPU. Fine-tuning larger models with a single consumer GPU is, in theory, not impossible as we can offload parts of the model to the CPU memory. However, it would be extremely slow, even with high-end CPUs.

Using multiple GPUs is the only alternative to keep fine-tuning fast enough. A configuration with 2×24 GB GPUs opens a lot of possibilities. 48 GB of GPU memory is enough to fine-tune 70B models such as Llama 3 70B and Qwen2 72B.

In this article, I explain how to fine-tune 70B LLMs using only two GPUs thanks to FSDP and QLoRA.

I first explain what is FSDP and then we will see how to modify a standard QLoRA fine-tuning code to run it on multiple GPUs. For the experiments and demonstrations, I use Llama 3.1 70B but it would work similarly for other LLMs. For the hardware, I relied on 2 RTX 3090 GPUs provided by RunPod (referral link). Using 2 RTX 4090 GPUs would be faster but more expensive.

I also made a notebook implementing the code described in this article. It’s available here: