Simple Ways to Speed Up Your PyTorch Model Training | by Alex Dremov

Contents

Shard optimizer state (ZeRO 1)Shard gradients (ZeRO 2)Shard model parameters (ZeRO 3)How to use FSDP?

How does it work?

As I said, when training on several GPUs, each process has exact copies of the same data when training with DDP. We can optimize it, by implementing several enhancements:

Shard optimizer state (ZeRO 1)

When training with DDP, each process holds a complete copy of the optimizer states. With ZeRO1, we shard these optimizer states across all ranks such that each rank holds only a portion of the optimizer states. During the backward pass, each rank only needs to gather the optimizer states relevant to its parameters to make an optimization step. This reduction in redundancy helps conserve memory.

💡 In case of the Adam, which holds parameters at roughly twice the model size, sharding the optimizer state among 8 ranks means each rank stores only one quarter (2/8) of the total state size.

Shard gradients (ZeRO 2)

We shard optimizer states. Now, we will modify the optimizer step to shard gradients too. If one rank has optimizer states for a portion of parameters, then we will:

aggregate all gradients relevant to the states the rank holds
calculate optimization step
send optimization step for a portion of parameters to all other ranks

As you noticed, now each rank does not need to hold a full replica of gradients. We can send gradients to a relevant rank as soon as they are available. So, we can reduce peak memory consumption even further.

Shard model parameters (ZeRO 3)

This is about to be epic.

Why do we need to store a full copy of the model on each rank? Let’s shard model parameters between all ranks. Then, we’re going to fetch the required parameters just in time during forward and backward.

💡 In case of large models, these optimisations can drammaticaly decrease memory consumption

How to use FSDP?

Quite simple actually. All we need is to wrap the model with FSDP:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributed.fsdp import FullyShardedDataParallel as FSDPmodel = FSDP(model)
# it's critical to get parameters from the wrapped model
# as only a portion of them returned (sharded part)
optimizer = optim.Adam(model.parameters())
# consuct training as usual
train(model, optimizer)

You can also specify the sharding strategy of FSDP. For example, we can select the SHARD_GRAD_OP strategy to achieve behaviour similar to that of ZeRO2. You can learn about other strategies here:

Also, you can wrap with FSDP submodules. In the example above, only one FSDP module is used, which will reduce computation efficiency and memory efficiency. The way it works is that, suppose your model contains 100 Linear layers. If you do FSDP(model), there will only be one FSDP unit which wraps the entire model. In that case, the allgather would collect the full parameters for all 100 linear layers, and hence won’t save CUDA memory for parameter sharding.

You can wrap submodules explicitly or define an auto-wrap policy. To learn more about FSDP, read the PyTorch guide: