Multi-Headed Cross Attention — By Hand | by Daniel Warfield

Multi-Headed Cross Attention — By Hand | by Daniel Warfield | Jan, 2025

Last updated: 2025/01/24 at 3:23 PM

Editor AI News

1 Min Read

Hand computing a fundamental component of multimodal models

“Crossing” By Daniel Warfield using MidJourney and Affinity Design 2. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained.

Cross Attention is a fundamental tool in creating AI models that can understand multiple forms of data simultaneously. Think language models that can understand images like the ones used in ChatGPt, or models that generate video based on text like Sora.

This summary goes over all critical mathematical operations within cross attention, allowing you to understand its inner workings at a fundamental level.

Cross attention is used when modeling with a variety of data types, each of which might format the input differently. For natural language data one would likely use a word to vector embedding, paired with positional encoding, to calculate a vector that represents each word.

For visual data, one might pass the image through an encoder specifically designed to summarize the image into a vector representation.