Introducing MobileSAM: A Lightweight, Speedy Approach to Segmenting Anything
The role of Artificial Intelligence (AI) in enhancing digital image processing is no secret. With its dynamic application range, AI is now fundamentally altering how we manipulate images and extract maximum value from them. In this exciting realm of possibilities, the Segment Anything Model (SAM) has made a significant impact. A brainchild of the Meta research team, SAM allows for the extraction of any object of interest from its background - a foundational step in many advanced vision applications like image editing.
link to paper: https://huggingface.co/papers/2306.14289
github: https://github.com/ChaoningZhang/MobileSAM

However, the full potential of SAM often goes unrealized due to its heavy computational requirements. Especially in the case of resource-constrained edge devices such as mobile applications, using SAM can be quite challenging. To address this issue and make SAM more accessible and practical, we are thrilled to introduce MobileSAM.
MobileSAM: SAM's Lightweight Version
Our objective with MobileSAM is to retain the excellent zero-shot transfer performance and the high versatility of SAM, but in a more compact, mobile-friendly form. We achieve this by replacing the heavy image encoder with a lighter version.
In the original SAM, the image encoder and mask decoder work closely together, but this simultaneous optimization presents a challenge when working with limited resources. This is where our approach diverges, employing a concept we call 'decoupled distillation'.
Decoupled Distillation: The Core of MobileSAM
Decoupled distillation involves distilling the knowledge from the image encoder ViT-H used in the original SAM into a lightweight image encoder, which automatically aligns with the mask decoder in the original SAM. This not only speeds up the training process, which can be completed within a single day on a single GPU, but also makes the model more than 60 times smaller.

In terms of performance, MobileSAM doesn't compromise. Despite being considerably smaller, it performs on par with the original SAM model. As for the speed of inference, MobileSAM takes roughly 10ms per image - 8ms for the image encoder and a mere 2ms for the mask decoder.
MobileSAM Vs. FastSAM: A Comparative Look
When compared to the contemporaneous FastSAM, MobileSAM stands out as the more efficient model. It is 7 times smaller and 4 times faster, making it highly suitable for mobile applications.
In conclusion, MobileSAM retains the excellent performance of the original SAM model while significantly reducing its size and increasing its speed. This makes it an ideal solution for resource-constrained environments and brings the power of SAM to a much wider range of applications. MobileSAM is set to be a gamechanger in the world of mobile image processing, ushering in a new era of efficiency and versatility.