Introducing Gemini Embeddings 2 Preview | Towards Data Science

Contents

A quick recap of RAG and Embedding So, what’s special about Gemini Embedding?Input limitations Setting up a development environment Setup Code Example 1 — Embedding images Example 2 — Embedding audio Summary

a preview version of its latest embedding model. This model is notable for one main reason. It can embed text, PDFs, images, audio, and video, making it a one-stop shop for embedding just about anything you’d care to throw at it.

If you’re new to embedding, you might wonder what all the fuss is about, but it turns out that embedding is one of the cornerstones of retrieval augmented generation or RAG, as it’s known. In turn, RAG is one of the most fundamental applications of modern artificial intelligence processing.

A quick recap of RAG and Embedding

RAG is a method of chunking, encoding and storing information that can then be searched using similarity functions that match search terms to the embedded information. The encoding part turns whatever you’re searching into a series of numbers called vectors — this is what embedding does. The vectors (embeddings) are then typically stored in a vector database.

When a user enters a search term, it is also encoded as embeddings, and the resulting vectors are compared with the contents of the vector database, usually using a process called cosine similarity. The closer the search term vectors are to parts of the information in the vector store, the more relevant the search terms are to those parts of the stored data. Large language models can interpret all this and retrieve and display the most relevant parts to the user.

There’s a whole bunch of other stuff that surrounds this, like how the input data should be split up or chunked, but the embedding, storing, and retrieval are the main features of RAG processing. To help you visualise, here’s a simplified schematic of a RAG process.

Image by Nano Banana

So, what’s special about Gemini Embedding?

Ok, so now that we know how crucial embedding is for RAG, why is Google’s new Gemini embedding model such a big deal? Simply this. Traditional embedding models — with a few exceptions — have been restricted to text, PDFs, and other document types, and maybe images at a push.

What Gemini now offers is true multi-modal input for embeddings. That means text, PDF’s and docs, images, audio and video. Being a preview embedding model, there are certain size limitations on the inputs right now, but hopefully you can see the direction of travel and how potentially useful this could be.

Input limitations

I mentioned that there are limitations on what we can input to the new Gemini embedding model. They are:

Text: Up to 8192 input tokens, which is about 6000 words
Images: Up to 6 images per request, supporting PNG and JPEG formats
Videos: A maximum of 2 minutes of video in MP4 and MOV formats
Audio: A maximum duration of 80 seconds, supports MP3, WAV.
Documents: Up to 6 pages long

Ok, time to see the new embedding model in practice with some Python coding examples.

Setting up a development environment

To begin, let’s set up a standard development environment to keep our projects separate. I’ll be using the UV tool for this, but feel free to use whichever methods you’re used to.

$ uv init embed-test --python 3.13
$ cd embed-test
$ uv venv
$ source embed-test/bin/activate
$ uv add google-genai jupyter numpy scikit-learn audioop-lts

# To run the notebook, type this in

$ uv run jupyter notebook

You’ll also need a Gemini API key, which you can get from Google’s AI Studio home page.

https://aistudio.google.com

Look for a Get API Key link near the bottom left of the screen after you’ve logged in. Take note of it as you’ll need it later.

Please note, other than being a user of their products, I have no association or affiliation with Google or any of its subsidiaries.

Setup Code

I won’t talk much about embedding text or PDF documents, as these are relatively straightforward and are covered extensively elsewhere. Instead, we’ll look at embedding images and audio, which are less common.

This is the setup code, which is common to all our examples.

import os
import numpy as np
from pydub import AudioSegment
from google import genai
from google.genai import types
from sklearn.metrics.pairwise import cosine_similarity

from IPython.display import display, Image as IPImage, Audio as IPAudio, Markdown

client = genai.Client(api_key='YOUR_API_KEY')

MODEL_ID = "gemini-embedding-2-preview"

Example 1 — Embedding images

For this example, we’ll embed 3 images: one of a ginger cat, one of a Labrador, and one of a yellow dolphin. We’ll then set up a series of questions or phrases, each one more specific to or related to one of the images, and see if the model can pick out the most appropriate image for each question. It does this by computing a similarity score between the question and each image. The higher this score, the more pertinent the question to the image.

Here are the images I’m using.

So, I have two questions and two phrases.

Which animal is yellow
Which is most likely called Rover
There’s something fishy going on here
A purrrfect image

# Some helper function
#

# embed text
def embed_text(text: str) -> np.ndarray:
    """Encode a text string into an embedding vector.

    Simply pass the string directly to embed_content.
    """
    result = client.models.embed_content(
        model=MODEL_ID,
        contents=[text],
    )
    return np.array(result.embeddings[0].values)
    
# Embed an image
def embed_image(image_path: str) -> np.ndarray:

    # Determine MIME type from extension
    ext = image_path.lower().rsplit('.', 1)[-1]
    mime_map = {'png': 'image/png', 'jpg': 'image/jpeg', 'jpeg': 'image/jpeg'}
    mime_type = mime_map.get(ext, 'image/png')

    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    result = client.models.embed_content(
        model=MODEL_ID,
        contents=[
            types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
        ],
    )
    return np.array(result.embeddings[0].values)

# --- Define image files ---
image_files = ["dog.png", "cat.png", "dolphin.png"]
image_labels = ["dog","cat","dolphin"]

# Our questions
text_descriptions = [
    "Which animal is yellow",
    "Which is most likely called Rover",
    "There's something fishy going on here",
    "A purrrfect image"
]

# --- Compute embeddings ---
print("Embedding texts...")
text_embeddings = np.array([embed_text(t) for t in text_descriptions])

print("Embedding images...")
image_embeddings = np.array([embed_image(f) for f in image_files])

# Use cosine similarity for matches
text_image_sim = cosine_similarity(text_embeddings, image_embeddings)

# Print best matches for each text
print("\nBest image match for each text:")
for i, text in enumerate(text_descriptions):
    # np.argmax looks across the row (i) to find the highest score among the columns
    best_idx = np.argmax(text_image_sim[i, :])
    best_image = image_labels[best_idx]
    best_score = text_image_sim[i, best_idx]
    
    print(f"  \"{text}\" => {best_image} (score: {best_score:.3f})")

Here’s the output.

Embedding texts...
Embedding images...

Best image match for each text:
  "Which animal is yellow" => dolphin (score: 0.399)
  "Which is most likely called Rover" => dog (score: 0.357)
  "There's something fishy going on here" => dolphin (score: 0.302)
  "A purrrfect image" => cat (score: 0.368)

Not too shabby. The model came up with the same answers I would have given. How about you?

Example 2 — Embedding audio

For the audio, I used a man’s voice describing a fishing trip in which he sees a bright yellow dolphin. Click below to hear the full audio. It’s about 37 seconds long.

If you don’t want to listen, here is the full transcript.

Hi, my name is Glen, and I want to tell you about a fascinating sight I witnessed last Tuesday afternoon while out ocean fishing with some friends. It was a warm day with a yellow sun in the sky. We were fishing for Tuna and had no luck catching anything. Boy, we must have spent the best part of 5 hours out there. So, we were pretty glum as we headed back to dry land. But then, suddenly, and I swear this is no lie, we saw a school of dolphins. Not only that, but one of them was bright yellow in colour. We never saw anything like it in our lives, but I can tell you all thoughts of a bad fishing day went out the window. It was mesmerising.

Now, let’s see if we can narrow down where the speaker talks about seeing a yellow dolphin.

Normally, when dealing with embeddings, we are only interested in general properties, ideas, and concepts contained in the source information. If we want to narrow down specific properties, such as where in an audio file a particular phrase occurs or where in a video a particular action or event occurs, this is a slightly more complex task. To do that in our example, we first have to chunk the audio into smaller pieces before embedding each chunk. We then perform a similarity search on each embedded chunk before producing our final answer.


# --- HELPER FUNCTIONS ---

def embed_text(text: str) -> np.ndarray:
    result = client.models.embed_content(model=MODEL_ID, contents=[text])
    return np.array(result.embeddings[0].values)
    
def embed_audio(audio_path: str) -> np.ndarray:
    ext = audio_path.lower().rsplit('.', 1)[-1]
    mime_map = {'wav': 'audio/wav', 'mp3': 'audio/mp3'}
    mime_type = mime_map.get(ext, 'audio/wav')

    with open(audio_path, 'rb') as f:
        audio_bytes = f.read()

    result = client.models.embed_content(
        model=MODEL_ID,
        contents=[types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)],
    )
    return np.array(result.embeddings[0].values)

# --- MAIN SEARCH SCRIPT ---

def search_audio_with_embeddings(audio_file_path: str, search_phrase: str, chunk_seconds: int = 5):
    print(f"Loading {audio_file_path}...")
    audio = AudioSegment.from_file(audio_file_path)
    
    # pydub works in milliseconds, so 5 seconds = 5000 ms
    chunk_length_ms = chunk_seconds * 1000 
    
    audio_embeddings = []
    temp_files = []
    
    print(f"Slicing audio into {chunk_seconds}-second pieces...")
    
    # 2. Chop the audio into pieces
    # We use a loop to jump forward by chunk_length_ms each time
    for i, start_ms in enumerate(range(0, len(audio), chunk_length_ms)):
        # Extract the slice
        chunk = audio[start_ms:start_ms + chunk_length_ms]
        
        # Save it temporarily to your folder so the Gemini API can read it
        chunk_name = f"temp_chunk_{i}.wav"
        chunk.export(chunk_name, format="wav")
        temp_files.append(chunk_name)
        
        # 3. Embed this specific chunk
        print(f"  Embedding chunk {i + 1}...")
        emb = embed_audio(chunk_name)
        audio_embeddings.append(emb)
        
    audio_embeddings = np.array(audio_embeddings)
    
    # 4. Embed the search text
    print(f"\nEmbedding your search: '{search_phrase}'...")
    text_emb = np.array([embed_text(search_phrase)])
    
    # 5. Compare the text against all the audio chunks
    print("Calculating similarities...")
    sim_scores = cosine_similarity(text_emb, audio_embeddings)[0]
    
    # Find the chunk with the highest score
    best_chunk_idx = np.argmax(sim_scores)
    best_score = sim_scores[best_chunk_idx]
    
    # Calculate the timestamp
    start_time = best_chunk_idx * chunk_seconds
    end_time = start_time + chunk_seconds
    
    print("\n--- Results ---")
    print(f"The concept '{search_phrase}' most closely matches the audio between {start_time}s and {end_time}s!")
    print(f"Confidence score: {best_score:.3f}")
    

# --- RUN IT ---

# Replace with whatever phrase you are looking for!
search_audio_with_embeddings("fishing2.mp3", "yellow dolphin", chunk_seconds=5)

Here is the output.

Loading fishing2.mp3...
Slicing audio into 5-second pieces...
  Embedding chunk 1...
  Embedding chunk 2...
  Embedding chunk 3...
  Embedding chunk 4...
  Embedding chunk 5...
  Embedding chunk 6...
  Embedding chunk 7...
  Embedding chunk 8...

Embedding your search: 'yellow dolphin'...
Calculating similarities...

--- Results ---
The concept 'yellow dolphin' most closely matches the audio between 25s and 30s!
Confidence score: 0.643

That’s pretty accurate. Listening to the audio again, the phrase “dolphin” is mentioned at the 25-second mark and “bright yellow” is mentioned at the 29-second mark. Earlier in the audio, I deliberately introduced the phrase “yellow sun” to see whether the model would be confused, but it handled the distraction well.

Summary

This article introduces Gemini Embeddings 2 Preview as Google’s new all-in-one embedding model for text, PDFs, images, audio, and video. It explains why that matters for RAG systems, where embeddings help turn content and search queries into vectors that can be compared for similarity.

I then walked through two Python examples showing how to generate embeddings for images and audio with the Google GenAI SDK, use similarity scoring to match text queries against images, and chunk audio into smaller segments to identify the part of a spoken recording that is semantically closest to a given search phrase.

The opportunity to perform semantic searches beyond just text and other documents is a real boon. Google’s new embedding model promises to open up a whole new raft of possibilities for multimodal search, retrieval, and recommendation systems, making it much easier to work with images, audio, video, and documents in a single pipeline. As the tooling matures, it could become a very practical foundation for richer RAG applications that understand far more than text alone.

You can find the original blog post announcing Gemini Embeddings 2 using the link below.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2