How does the Segment-Anything Model’s (SAM’s) decoder work? | by Wei Yi

18 min read

19 hours ago

A deep dive into how the Segment-Anything model’s decoding procedure, with a focus on how its self-attention and cross-attention mechanism works.

The Segment-Anything (SAM) model is a 2D interactive segmentation model, or guided model. SAM requires user prompts to segment an image. These prompts tell the model where to segment. The inputs to the SAM model are a 2D image and a set of prompts. Users prompts tell the model where to focus. The output of the model is a set of segmentation masks at different levels and a confidence score associated with each mask.

A segmentation mask is a 2D binary array with the same size as the input image. In this 2D array, an entry at location (x, y) has a value 1 if the model thinks that the pixel at location (x, y) belongs to the segmented area. Otherwise, the entry is 0. Those confidence scores indicate model’s belief on the quality of each segmentation, higher score means higher quality.

The network architecture of SAM consists of an encoder and a decoder:

The encoder takes in the image and user prompt inputs to produce image embedding, image positional embedding and user prompt embeddings.
The decoder takes in the various embeddings to produce segmentation masks and confidence scores

This article focuses on how SAM’s decoder works. I will write another article about the encoder.

Together with the input image to segment, SAM also requires user prompts, and it support the following kinds of prompts.

Different kinds of input user prompts

Mouse clicks. A mouse click can be a positive click that tells the model to include the clicked location in the produced segmentation mask. It can also be a negative click telling the model to avoid the clicked location. SAM accept multiple clicks either positive or negative to the model.
Bounding boxes. Bounding boxes are always positive signals, telling the model that the produced segmentation mask should be inside the boxed area. There are no negative bounding boxes. SAM accepts multiple bounding boxes.