Multi-Headed Self Attention — By Hand | by Daniel Warfield

Multi-Headed Self Attention — By Hand | by Daniel Warfield | Jul, 2024

Last updated: 2024/07/12 at 4:41 PM

Editor AI News

1 Min Read

Hand computing the cornerstone of modern AI

“Focus” By Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained.

Multi-Headed Attention is likely the most important architectural paradigm in machine learning. This summary goes over all critical mathematical operations within multi-headed self attention, allowing you to understand it’s inner workings at a fundamental level. If you’d like to learn more about the intuition behind this topic, check out the IAEE article.

Multi-headed self attention (MHSA) is used in a variety of contexts, each of which might format the input differently. In a natural language processing context one would likely use a word to vector embedding, paired with positional encoding, to calculate a vector that represents each word. Generally, regardless of the type of data, multi-headed self attention expects of sequence of vectors, where each vector represents something.