Learning Triton One Kernel at a Time: Matrix Multiplication

Contents

Naive GEMM Tiled GEMM GPU Memory Hierarchy Parallel Tiled GEMM Memory Coalescing Triton Implementation Conclusion Useful Resources

multiplication is undoubtedly the most common operation performed by GPUs. It is the fundamental building block of linear algebra and shows up across a wide spectrum of different fields such as graphics, physics simulations and scientific computing while being ubiquitous in machine learning.

In today’s article, we’ll break down the conceptual implementation of general matrix-matrix multiplication (GEMM) while introducing several optimisation concepts such as tiling and memory coalescing. Finally, we’ll implement GEMM in Triton!

This article is the second of a series on Triton and GPU kernels, If you are not familiar with Triton or need a refresher on GPU basics, check out the previous article! All the code showcased in this article is available on GitHub.