Multimodal Embeddings: An Introduction | by Shaw Talebi

Contents

Use case 1: 0-shot Image Classification Use case 2: Image Search

Use case 1: 0-shot Image Classification

The basic idea behind using CLIP for 0-shot image classification is to pass an image into the model along with a set of possible class labels. Then, a classification can be made by evaluating which text input is most similar to the input image.

We’ll start by importing the Hugging Face Transformers library so that the CLIP model can be downloaded locally. Additionally, the PIL library is used to load images in Python.

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

Next, we can import a version of the clip model and its associated data processor. Note: the processor handles tokenizing input text and image preparation.

# import model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")# import processor (handles text tokenization and image preprocessing)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

We load in the below image of a cat and create a list of two possible class labels: “a photo of a cat” or “a photo of a dog”.

# load image
image = Image.open("images/cat_cute.png")# define text classes
text_classes = ["a photo of a cat", "a photo of a dog"]

Next, we’ll preprocess the image/text inputs and pass them into the model.

# pass image and text classes to processor
inputs = processor(text=text_classes, images=image, return_tensors="pt", 
padding=True)# pass inputs to CLIP
outputs = model(**inputs) # note: "**" unpacks dictionary items

To make a class prediction, we must extract the image logits and evaluate which class corresponds to the maximum.

# image-text similarity score
logits_per_image = outputs.logits_per_image 
# convert scores to probs via softmax
probs = logits_per_image.softmax(dim=1) # print prediction
predicted_class = text_classes[probs.argmax()]
print(predicted_class, "| Probability = ", 
round(float(probs[0][probs.argmax()]),4))

>> a photo of a cat | Probability =  0.9979

The model nailed it with a 99.79% probability that it’s a cat photo. However, this was a super easy one. Let’s see what happens when we change the class labels to: “ugly cat” and “cute cat” for the same image.

>> cute cat | Probability =  0.9703

The model easily identified that the image was indeed a cute cat. Let’s do something more challenging like the labels: “cat meme” or “not cat meme”.

>> not cat meme | Probability =  0.5464

While the model is less confident about this prediction with a 54.64% probability, it correctly implies that the image is not a meme.

Use case 2: Image Search

Another application of CLIP is essentially the inverse of Use Case 1. Rather than identifying which text label matches an input image, we can evaluate which image (in a set) best matches a text input (i.e. query)—in other words, performing a search over images.

We start by storing a set of images in a list. Here, I have three images of a cat, dog, and goat, respectively.

# create list of images to search over
image_name_list = ["images/cat_cute.png", "images/dog.png", "images/goat.png"]image_list = []
for image_name in image_name_list:
image_list.append(Image.open(image_name))

Next, we can define a query like “a cute dog” and pass it and the images into CLIP.

# define a query
query = "a cute dog"# pass images and query to CLIP
inputs = processor(text=query, images=image_list, return_tensors="pt", 
padding=True)

We can then match the best image to the input text by extracting the text logits and evaluating the image corresponding to the maximum.

# compute logits and probabilities
outputs = model(**inputs)
logits_per_text = outputs.logits_per_text
probs = logits_per_text.softmax(dim=1)# print best match
best_match = image_list[probs.argmax()]
prob_match = round(float(probs[0][probs.argmax()]),4)
print("Match probability: ",prob_match)
display(best_match)

>> Match probability:  0.9817

Best match for query “a cute dog”. Image from Canva.

We see that (again) the model nailed this simple example. But let’s try some trickier examples.

query = "something cute but metal 🤘"

>> Match probability:  0.7715

Best match for query “something cute but metal 🤘”. Image from Canva.

query = "a good boy"

>> Match probability:  0.8248

Best match for query “a good boy”. Image from Canva.

query = "the best pet in the world"

>> Match probability:  0.5664

Best match for query “the best pet in the world”. Image from Canva.

Although this last prediction is quite controversial, all the other matches were spot on! This is likely since images like these are ubiquitous on the internet and thus were seen many times in CLIP’s pre-training.

Multimodal embeddings unlock countless AI use cases that involve multiple data modalities. Here, we saw two such use cases, i.e., 0-shot image classification and image search using CLIP.

Another practical application of models like CLIP is multimodal RAG, which consists of the automated retrieval of multimodal context to an LLM. In the next article of this series, we will see how this works under the hood and review a concrete example.

More on Multimodal models 👇