A Guide to Understanding GPUs and Maximizing GPU Utilization

Contents

Introduction

demands large-scale models and data, pushing compute hardware to its limits. Whether you are training models on complex images, processing long-context documents, or running high-throughput reinforcement learning environments, maximizing your GPU efficiency is critical. It is not uncommon to be training or running inference on models with billions of parameters across terabytes of data. An unoptimized setup can turn a quick experiment into hours or days of waiting.

When training or inference crawls, our instinct is often to blame the model size or mathematical complexity. Modern GPUs are fast calculators, but they are dependent on the CPU to allocate work and the on-device data storage location on the GPU. Usually, computation on the GPU is not the bottleneck. If your CPU is struggling to load, preprocess, and transfer your batches across the PCIe bridge, your GPU sits idle, starving for data.

The good news? You don’t need to write custom CUDA kernels or debug low-level GPU code to fix it. If you are a ML researcher, engineer, or a hobbyist interested in optimizing GPU pipelines, this blog is for you! In this post, we explore the mechanics of this bottleneck and walk through actionable engineering decisions to maximize GPU utilization. We will cover everything from fundamental PyTorch pipeline tweaks to more advanced hardware optimizations and Hugging Face integrations.

💡 Note

We will be assuming a basic working knowledge of Python and PyTorch DataLoaders going forward. No deep understanding of GPU architecture is required as we will provide a high-level overview of the GPU and how it works. All techniques discussed will be applicable to training and inference unless explicitly stated.

GPU Overview

To start, Graphics Processing Units (GPUs) exploded in popularity alongside deep learning due to their ability to train and run models at lightning speeds with parallelizable operations. But what does a GPU actually do? Before optimizing our code, we need a shared mental model of what happens under the hood in a GPU, the differences from a CPU, and the dataflow between the two.

What is a GPU? How does it differ from a CPU?

GPUs do not universally outperform CPUs. CPUs are designed to solve highly sequential problems with low latency and complex branching (ideal qualities for running an operating system). Alternatively, GPUs consist of thousands of cores optimized to complete basic operations in parallel [1]. While a CPU would need to process a thousand matrix multiplications sequentially (or up to the limit of its cores), a GPU can run all these operations in parallel (in a fraction of a second). In machine learning, we rarely deal with highly sequential problems requiring a CPU. Most operations are matrix multiplication, a highly parallelizable task.

At a high level, a GPU consists of thousands of tiny processing cores grouped into Streaming Multiprocessors (SMs) designed for massive parallel computation. SMs manage, schedule, and execute hundreds of threads concurrently. A high-bandwith memory pool, Video RAM (VRAM), surrounds the compute units alongside ultra-fast caches that temporarily hold data for quick access. VRAM is the main warehouse where your model weights, gradients, and incoming data batches live. CPUs and GPUs communicate with each other via an interface bridge, which we will analyze more in depth below, as this is the main bridge where bottlenecks occur.

💡 Note

If you are using NVIDIA GPUs, there is another component within the GPU called Tensor Cores, which accelerate mixed-precision matrix math used in machine learning. They will come up again when we discuss Mixed Precision.

PCIe Bridge

As we just mentioned, data travels from the CPU to the GPU across an interface bridge called the Peripheral Component Interconnect Express (PCIe). Data originates on your disk, loads into CPU RAM, and then crosses the PCIe bus to reach the GPU’s VRAM. Every time you send a PyTorch tensor to the device using .to('cuda'), you are invoking a transfer across this bridge. If your CPU is constantly sending tiny tensors one by one instead of large, contiguous blocks, it quickly clogs the bridge with latency and overhead.

What is GPU Utilization?

Now that we have covered GPU anatomy, we need to understand the metrics we are tracking. When using nvidia-smi, Weights and Biases, PyTorch profiler, NVIDIA Nsight Systems, or any other method of GPU tracking, we generally analyze two main percentages: Memory Usage and Volatile GPU-Util.

Memory Usage (VRAM): VRAM is the GPU’s physical memory. Your VRAM can be at 100% capacity while your GPU is doing nothing. High VRAM usage only means you have successfully loaded your model weights, gradients, and a batch of data onto the GPU’s physical memory.
Volatile GPU-Util (Compute Utilization): This is the crucial metric. It measures the percentage of time over the past sample period (usually 1 second) that the GPU’s computing kernels were actively executing instructions. The goal is to consistently maximize this percentage!

CPU-GPU Bottleneck

Now that we have covered CPUs and GPUs, let’s look at how the CPU-GPU bottleneck occurs and what we can do to fix it. The GPU has thousands of cores ready to parallelize operations, but it needs the CPU to delegate tasks. When you train a model, your GPU cannot read directly from your SSD. The CPU must load and decode the raw data, apply augmentations, batch it, and hand it off. If your CPU takes 50 milliseconds to prepare a batch, and your GPU only takes 10 milliseconds to compute the forward and backward passes, your GPU spends 40 milliseconds idling.

Roofline Model

This problem is formalized into the Roofline Model. It measures the performance (FLOPs/second) against arithmetic intensity (FLOPs/byte), FLOPs being floating-point operations per second. When arithmetic intensity is low (you load a massive amount of data but do very little math with it), you hit the slanted “Memory-Bound” roof. When arithmetic intensity is high (you load a small amount of data but do a massive amount of matrix multiplication with it), you hit the flat “Compute-Bound” roof.

GPU parallelism is rarely the bottleneck for research experiments. Typically slowdowns occur in the memory regime: CPU data parsing, PCle bus clogging, or VRAM bandwidth limits. The key to this is almost always better dataflow management.

Optimizing the Data Pipeline

Tracking GPU Utilization

Before we can optimize the data pipeline, we must understand how to monitor GPU utilization and VRAM. The easiest way is to use nvidia-smi to get a table with all available GPUs, current VRAM, and volatile GPU Utilization.

Sample output of nvidia-smi. The CUDA and driver versions are shown in the header. Each row of the table represents a GPU. The columns show GPU ID, Power Usage, GPU Utilization, and Memory Usage. Image by Author.

With watch -n 1 nvidia-smi, metrics can be monitored and updated every second. However, the best way to get more detailed GPU metrics is either using the PyTorch Profiler or Weights and Biases. NVIDIA Nsight Systems is also a great tool for monitoring that is relatively fast to setup here. Weights and Biases offers the easiest visualization of GPU utilization graphs for our purposes, and these graphs are the easiest way to diagnose poor GPU optimization. An example setup for Weights and Biases is shown here (take directly from Weights and Biases’ documentation [2]):

import wandb

# Project that the run is recorded to
project = "my-awesome-project"

# Dictionary with hyperparameters
config = {"epochs": 1337, "lr": 3e-4}

# The `with` syntax marks the run as finished upon exiting the `with` block,
# and it marks the run "failed" if there's an exception.
#
# In a notebook, it may be more convenient to write `run = wandb.init()`
# and manually call `run.finish()` instead of using a `with` block.
with wandb.init(project=project, config=config) as run:
    # Training code here

    # Log values to W&B with run.log()
    run.log({"accuracy": 0.9, "loss": 0.1})

The easiest indicator of an unoptimized GPU pipeline is a sawtooth GPU utilization graph. This is where GPU utilization idles at 0%, briefly spikes to 100%, and then idles back at 0%, signifying a CPU to GPU bottleneck issue. Hitting periodic 100% utilization is not a sign that GPU utilization is maximized. The GPU is tearing through available data in a fraction of a second, and the 0% valleys represent the wait for the CPU to prepare the next batch. The goal is continuous utilization- a flat, unbroken line near 100%, meaning the GPU never has to wait. An example of a sawtooth GPU utilization graph in the same format as Weights and Biases is shown below:

Graph of sawtooth GPU utilization commonly seen in unoptimized ML pipelines. Image by Author.

Let’s see why this happens in a basic PyTorch DataLoader. By default, a PyTorch DataLoader could be defined as follows:

DataLoader(dataset, batch_size=32, shuffle=True, num_workers=0, pin_memory=False)

With num_workers=0 and pin_memory=False (the default values in a DataLoader), the main Python process has to do everything sequentially:

Fetch the files from the disk.
Apply image augmentations or preprocess text.
Move the batch to the GPU.
GPU computes the forward and backward passes.

This is the worst case scenario for GPU utilization. For steps 1-3, GPU utilization sits at 0%. When step 3 is complete, the GPU utilization spikes to 100%, and then steps 1 and 2 are repeated for the next batch.

The next few sections discuss how to optimize the data pipeline.

`num_workers` (Parallelizing the CPU)

The most impactful fix is parallelizing data preparation. By increasing num_workers, you tell PyTorch to spawn dedicated subprocesses for batch fetching and preparation in the background while the GPU computes. However, more workers do not always mean more speed. If you have an 8-core CPU and set num_workers=16, you will slow down your training due to context-switching overhead and Inter-Process Communication (IPC). Each worker creates a copy of the dataset in memory. Too many workers can cause memory thrashing and crash your system. A good rule of thumb is starting at num_workers=4 and profiling from there.

💡 Note

Optimal num_workers won’t fix a slow Dataset implementation. To keep __getitem__ efficient, avoid instantiating objects, per-item DB connections, or heavy preprocessing in the function call. Its only job is to fetch raw bytes, convert them to a tensor, and return.

`pin_memory=True` (Optimized Data Transfer)

Even with background workers, how the data physically transfers across the PCIe bridge matters. Normally tensors don’t go straight to the GPU when you transfer them. It is first read from the disk into paged system RAM and copied by the CPU into a special unpaged (or page-locked) area of RAM before crossing the PCIe bus to GPU VRAM.

Setting pin_memory=True creates a data fast lane. It instructs your DataLoader to allocate batches directly into page-locked memory. This allows the GPU to use Direct Memory Access (DMA) to pull the data directly across the bridge without the CPU having to act as a middleman for the final transfer, significantly reducing latency.

pin_memory=True comes with a hardware trade-off. With page-locked memory, the operating system can’t swap this RAM to the hard drive if it runs out of space. Normally, data can be swapped to the disk when out of memory, but if you run out of page-locked RAM, an Out of Memory error will be thrown. If you get an OOM in your script, be sure to first investigate the pin_memory flag before more complex debugging. Furthermore, be careful combining pin_memory=True with a high num_workers count. Because each worker process is actively generating and holding batches in memory, this can rapidly inflate your system’s locked RAM footprint.

`prefetch_factor` (Queueing up Data)

Sometimes the bottleneck isn’t the CPU’s processing power, but the disk itself. When reading thousands of files from a network drive, there may be sudden spikes in I/O latency to no fault of the user.

The prefetch_factor argument dictates how many batches each worker should prepare and hold in a queue on the CPU in advance. If you have 4 workers and a prefetch_factor=2, the CPU will always try to keep 8 ready-to-go batches queued up. If the disk suddenly hangs for half a second on a corrupted file, the GPU won’t starve- it just pulls from the prefetch queue while the worker catches up.

💡 Note

Be sure to not set prefetch_factor too high, as it may cause the GPU to wait for the CPU to catch up. A good rule of thumb is to set prefetch_factor to 2 or 3. It can also cause CUDA Out of Memory errors if you set it too high.

By adjusting these parameters, you can smooth sawtooth utilization into a high, continuous utilization curve. This is an example of what you should be aiming for:

Graph of continuous utilization seen after making optimizations to ML pipeline. Image by Author.

The GPU utilization is now a high, continuous line near 100%, meaning the GPU never has to wait! Just with data loading parameters, we were able to go from inefficient GPU utilization to effective, continuous utilization.

Compute and Memory on the GPU

Once the DataLoader is optimized, data is flying across the PCIe bridge, no longer creating the sawtooth GPU utilization bottleneck. But once the data moves to GPU VRAM, how do we make sure it is being used efficiently?

Batch Size

Let’s revisit the Roofline Model. To escape the slanted “Memory-Bound” roof and reach the flat “Compute-Bound” maximum performance of your GPU, you need high Arithmetic Intensity. The easiest way to increase arithmetic intensity is to increase batch size. Loading a single massive matrix of size 1024×1024 and doing the math all at once is more efficient for the GPU’s streaming multiprocessors than loading 32 smaller matrices sequentially.

In theory we should just load all of our data in at once and have a singular batch, but in practice, this ends in the dreaded CUDA Out of Memory error. This means you are trying to load more data in VRAM, but there is no more space to allocate on the GPU.

💡 Note

You may be wondering why batch size, num_workers, or any deep learning metric is commonly a power of 2. It’s not just a de facto rule but a result of NVIDIA hardware design.

Inside an SM, a GPU does not execute threads individually. It groups them into units called Warps, and on NVIDIA GPUs, a warp always contains exactly 32 threads.

Beyond the 32-thread warp limit, the physical memory controllers on a GPU fetch data from VRAM in power-of-2 byte chunks. If your tensor dimensions are not aligned with these chunks, the GPU has to perform multiple memory fetches to grab the overflowing data.

For maximum efficiency, choose multiples of 32 or 64 (or 8 if you have memory limitations) for batch size and powers of 2 for other deep learning metrics.

Mixed Precision

By default, PyTorch initializes all model weights and data in FP32 (32-bit floating point). For almost all deep learning tasks, this is overkill. The solution is model quantization, casting tensors down to FP16 or BF16 (16-bit). Why does this matter for utilization?

It halves the memory bottleneck: Only half as many bytes are being moved across the PCIe bridge and within the GPU’s internal VRAM.
It unlocks the hardware: Modern NVIDIA GPUs possess specialized silicon called Tensor Cores. These cores sit completely idle if you pass them FP32 math. They are specifically engineered to execute 16-bit matrix multiplications at high speeds.

Before casting to FP16, make sure performance is the same with a subsampled dataset to confirm the task at hand doesn’t require FP32. Another option is to use PyTorch’s torch.autocast. This built-in function wraps the forward pass in a context manager that automatically figures out which operations are safe to cast to 16-bit (like matrix multiplications) and which need to stay in 32-bit for numerical stability (like Softmax or LayerNorm). It is essentially a free 2x speedup.

However, FP16 or BF16 is not always the best method for quantization. On modern NVIDIA architectures (A100s or H100s), BF16 should be used instead of FP16 to avoid any NaN losses or gradient underflow. Another good option is NVIDIA proprietary TF32 (TensorFloat-32) format which is a 19-bit floating-point format that maintains FP32 accuracy with a 10x speedup over FP32 on A100s and H100s [3].

Gradient Accumulation (Training Only)

When training a model, instead of trying to force a large batch size into VRAM and crashing, use a smaller “micro-batch”. When using “micro-batch” strategies, instead of updating the model’s weights immediately (calling optimizer.step()), accumulate the gradients (loss.backward()) over several consecutive forward passes, creating an “effective batch size”. For example, a batch size of 8 with 8 steps of gradient accumulation yields the same mathematical update as a batch size of 64 with a single update. Micro-batch strategies can stabilize training without blowing up your VRAM footprint.

Kernel Efficiency

The final efficiency concept concerns kernel efficiency in training and inference. This is an exploration into understanding how CUDA kernels work within a GPU for those intending to work with custom architectures. PyTorch 2.0+ (which is almost always used), abstracts this away from the user with a simple command shown below. Let’s start with a deeper dive into the GPU and how kernels function:

Every time you execute an operation in PyTorch (we will use d=a+b+c as a simple example), you are launching a “kernel” on the GPU. Abstracting away the intricacies of a GPU, it can’t do math in one go. The GPU must:

Read a and b from VRAM into the SM cache.
Compute a+b.
Write that intermediate result back to VRAM.
Read that intermediate result and c from VRAM.
Compute the final addition.
Write d back to VRAM.

This is Kernel Overhead. When building custom architectures from scratch, it is easy to accidentally create hundreds of tiny, sequential reads and writes. GPU cores spend all their time waiting on the internal VRAM rather than doing math. Fusing custom CUDA kernels can reduce overhead, but thankfully, PyTorch 2.0+ implicitly handles this with torch.compile(). PyTorch analyzes the entire computational graph and uses OpenAI’s Triton to automatically write highly optimized, fused kernels that can shave hours off a long training run by shortening memory round-trips.

While torch.compile() is phenomenal for automatic, general-purpose operation fusion, sometimes squeezing out performance gains requires highly specialized kernels. Historically, integrating hand-written CUDA or Triton kernels into your research meant wrestling with complex C++ build systems and matching CUDA toolkit versions. Thankfully, the Hugging Face kernels library [4] treats low-level compute operations like pretrained models. Instead of compiling from source, you can fetch pre-compiled, hardware-optimized binaries directly from the Hub with a single Python function call. The library automatically detects your exact PyTorch and GPU environment and downloads the perfect match in seconds.

A simple example of using the Hugging Face kernels library is shown below (from the Hugging Face kernels library documentation [4]):

import torch

from kernels import get_kernel

# Download optimized kernels from the Hugging Face hub
activation = get_kernel("kernels-community/activation", version=1)

# Random tensor
x = torch.randn((10, 10), dtype=torch.float16, device="cuda")

# Run the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)

print(y)

Hugging Face Transformers

As a quick aside, the Hugging Face transformers library [5] includes all the functionality to optimize your model for the GPU we have discussed through its TrainingArguments and Trainer classes. For greatest abstraction, you can simply use the example below to run a model through Hugging Face.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    # Specify an output directory
    output_dir="./results",
    # Data Pipeline
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2,
    # Compute and Memory
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    # Precision and Hardware
    bf16=True,
    tf32=True,
    # Kernel Efficiency
    torch_compile=True,
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

Conclusion

Optimizing your GPU pipeline comes down to two core principles: keeping the GPU constantly loaded with data and making every operation count once the data arrives.

On the data pipeline side, tuning the DataLoader by increasing num_workers, enabling pin_memory, and setting a prefetch_factor, yields a more continuous GPU utilization. On the compute side, maximizing batch size (or utilizing gradient accumulation), dropping to mixed precision (FP16/BF16 or TF32), and fusing operations via torch.compile() or the Hugging Face kernels library drastically reduces VRAM traffic and kernel overhead.

Together, these tweaks turn hours of wasted, memory-bound idle time into a high-speed, fully utilized research pipeline.

References

[1] CPU vs. GPU layout — NVIDIA Cuda Programming Guide

[2] Weights and Biases Setup – Weights and Biases Github

[3] TensorFloat-32 Precision Format — NVIDIA Blog

[4] Hugging Face kernels library – Hugging Face Github

[5] Hugging Face transformers library – Hugging Face Github