SAM: Segment Anything Model. Quickly customize your product landing… | by Rafael Guedes

Quickly customize your product landing page with SAM

Transformers have been widely applied to Natural Language Processing use cases but they can also be applied to several other domains of Artificial Intelligence such as time series forecasting or computer vision.

Great examples of Transformers models applied to computer vision are Stable Diffusion for image generation, Detection Transformer for object detection or, more recently, SAM for image segmentation. The great benefit that these models bring is that we can use text prompts to manipulate images without much effort, all it takes is a good prompt.

The use cases for this type of models are endless, specially if you work at an e-commerce company. A simple, time consuming and expensive use case is the process from photographing an item to posting it on the website for sale. Companies need to photograph the items, remove the props used and, finally, in-paint the hole left by the prop before posting the item in the website. What if this entire process could be automated by AI and our human resources would just handle the complex use cases and review what was done by AI?

In this article, I provide a detailed explanation of SAM, an image segmentation model, and its implementation on a hypothetical use case where we want to perform an A/B test to understand which type of background would increase conversion rate.

Figure 1: Segment Anything Model (image generated by the author with DALL-E)

As always, the code is available on Github.

Segment Anything Model (SAM) [1] is a segmentation model developed by Meta that aims to create masks of the objects in an image guided by a prompt that can be text, a mask, a bounding box or just a point in an image.

The inspiration comes from the latest developments in Natural Language Processing and, particularly, from Large Language Models, where given an ambiguous prompt, the user expects a coherent response. In the same line of thought, the authors wanted to create a model that would return a valid segmentation mask even when the prompt is ambiguous and could refer to multiple objects in an image. This reasoning led to the development of a pre-trained algorithm and a…