Now, let’s assume you’re throwing a dinner party and it’s all about Hollywood and the big movies, and you want to seat people based on what they like. You could just calculate “distance” between their preferences (genres, perhaps even hobbies?) and find out who should sit together. But deciding how you measure that distance can be the difference between compelling conversations and annoyed participants. Or awkward silences.
And yes, that company party flashback is repeating itself. Sorry for that!
The same is true in the world of vectors. The distance metric defines how “similar” two vectors look, and therefore, ultimately, how well your system performs to predict an outcome.
Euclidean Distance: Straightforward, but Limited
Euclidean distance measures the straight-line distance between two points in space, making it easy to understand:
- Euclidean distance is fine as long as vectors are physical locations.
- However, in high-dimensional spaces (like vectors representing user behavior or preferences), this metric often falls short. Differences in scale or magnitude can skew results, focusing on scale over actual similarity.
Example: Two vectors might represent your dinner guest’s preferences for how much streaming services are used:
vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.vec2 = [1, 2, 1]
# Dinner guest B likes the same genres but consumes less streaming overall.
While their preferences align, Euclidean distance would make them seem vastly different because of the disparity in overall activity.
But in higher-dimensional spaces, such as user behavior or textual meaning, Euclidean distance becomes increasingly less informative. It overweights magnitude, which can obscure comparisons. Consider two moviegoers: one has seen 200 action movies, the other has seen 10, but they both like the same genres. Because of their sheer activity level, the second viewer would appear much less similar to the first when using Euclidean distance though all they ever watched is Bruce Willis movies.
Cosine Similarity: Focused on Direction
The cosine similarity method takes a different approach. It focuses on the angle between vectors, not their magnitudes. It’s like comparing the path of two arrows. If they point the same way, they are aligned, no matter their lengths. This shows that it’s perfect for high-dimensional data, where we care about relationships, not scale.
- If two vectors point in the same direction, they’re considered similar (cosine similarity approx of 1).
- When opposing (so pointing in opposite directions), they differ (cosine similarity ≈ -1).
- If they’re perpendicular (at a right angle of 90° to one another), they are unrelated (cosine similarity close to 0).
This normalizing property ensures that the similarity score correctly measures alignment, regardless of how one vector is scaled in comparison to another.
Example: Returning to our streaming preferences, let’s take a look at how our dinner guest’s preferences would look like as vectors:
vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.vec2 = [1, 2, 1]
# Dinner guest B likes the same genres but consumes less streaming overall.
Let us discuss why cosine similarity is really effective in this case. So, when we compute cosine similarity for vec1 [5, 10, 5] and vec2 [1, 2, 1], we’re essentially trying to see the angle between these vectors.
The dot product normalizes the vectors first, dividing each component by the length of the vector. This operation “cancels” the differences in magnitude:
- So for vec1: Normalization gives us [0.41, 0.82, 0.41] or so.
- For vec2: Which resolves to [0.41, 0.82, 0.41] after normalization we will also have it.
And now we also understand why these vectors would be considered identical with regard to cosine similarity because their normalized versions are identical!
This tells us that even though dinner guest A views more total content, the proportion they allocate to any given genre perfectly mirrors dinner guest B’s preferences. It’s like saying both your guests dedicate 20% of their time to action, 60% to drama, and 20% to comedy, no matter the total hours viewed.
It’s this normalization that makes cosine similarity particularly effective for high-dimensional data such as text embeddings or user preferences.
When dealing with data of many dimensions (think hundreds or thousands of components of a vector for various features of a movie), it is often the relative significance of each dimension corresponding to the complete profile rather than the absolute values that matter most. Cosine similarity identifies precisely this arrangement of relative importance and is a powerful tool to identify meaningful relationships in complex data.