Explaining the Attention Mechanism | by Nikolaus Correll

Building a Transformer from scratch to build a simple generative model

The Transformer architecture has revolutionized the field of AI and forms the basis not only for ChatGPT, but has also led to unprecedented performance in image recognition, scene understanding, and robotics. Unfortunately, the transformer architecture in itself is quite complex, making it hard to spot what really matters, in particular if you are new to machine learning. The best way to understand Transformers is to think about a problem as simple as generating random names, character by character. In a previous article, I have explained all the tooling that you will need for such a model, including training models in Pytorch and Batch-Processing, by focussing on the simplest possible model: predicting the next character based on its frequency given the former character in a dataset of common names.

In this article, we build up on this baseline to introduce a state-of-the-art model, the Transformer. We will start by providing basic code to read and pre-process the data, then introduce the Attention architecture by focussing on its key aspect first — cosine similarity between all tokens in a sequence. We will then add query, key, and value to build…