Vision foundational or fundamental models are used in computer vision tasks. These models serve as the building blocks or initial frameworks for more complex and specific models. Researchers and developers often utilize these as starting points and adapt or enhance them to address specific challenges or optimize for particular applications.
Vision models are extended to video data for action recognition, video captioning, and anomaly detection in surveillance footage. Their adaptability and efficacy in handling various computer vision tasks make them integral to modern AI applications.
Researchers at Kyung Hee University resolve the problems in one such vision model called SAM (Segment Anything Model). Their method solves two practical image segmentation challenges: segment anything (SegAny) and everything (SegEvery). As the name suggests, SegAny utilizes only a certain prompt to segment a single thing of interest in the image, whereas SegEvery segments all things in the image.
SAM consists of a ViT-based image encoder and a prompt-guided mask decoder. The mask-decoder generates fine-grained masks by adopting two-way attention to enable efficient interaction between image encoders. SegEvery is not a promptable segmentation task, so it directly generates images using prompts.
Researchers identify why SegEvery in SAM is slow and propose object-aware box prompts. These prompts are used instead of default grid-search point prompts, significantly increasing image generation speed. They show that the object-ware prompt sampling strategy is compatible with the distilled image encoders in MobileSAM. This will further contribute to a unified framework for efficient SegAny and SegEvery.
Their research mainly focuses on determining whether an object is in a certain region of the image. The object detection tasks already solve this issue, but most of the generated bounding boxes overlap. It requires pre-filtering before using it as a valid prompt to eliminate the overlap.
The challenge with the given point prompt lies in its necessity to forecast three output masks, aiming to tackle ambiguity, thus demanding further mask filtering. In contrast, the box prompt stands out for its capacity to provide more detailed information, yielding superior-quality masks with reduced ambiguity. This feature alleviates the requirement to predict three masks, making it a more advantageous choice for SegEvery due to its efficiency.
In conclusion, their research is on MobileSAMv2 and enhances SegEvery’s speed by introducing an innovative prompt sampling method within the prompt-guided mask decoder. By replacing the conventional grid-search approach with their object-aware prompt sampling technique, they notably enhance SegEvery’s efficiency without compromising overall performance, showcasing significant improvements.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.