Remember to follow me on Social Media!
In biological imaging, we often deal with the "invisible" – microscopic cells, bacterial colonies, or subtle patterns that evade easy detection. Traditional image analysis required painstaking tuning of algorithms or training models from scratch on limited data. Today, a new wave of foundation models promises to change that. Meta's Segment Anything Model (SAM) and related vision transformers, abbreviated as the ViT term like DINO (self-distillation with no labels), are generalist vision models trained on massive data. They can zero-shot segment or describe objects in images without prior task-specific training. Even more exciting, Grounding DINO extends this capability to open-vocabulary object detection – finding objects in an image based on text prompts. These advances foreshadow a future of bioimage analysis where AI can segment anything we need, even in complex experimental contexts, with minimal human supervision.
But, how well do these tools work on biological images, and how are labs leveraging them? In this post, we explore recent applications of SAM and DINO in biology – from microscopic cell imaging to high-throughput plate assays – and how pairing them with clever filtering and domain knowledge can unlock new workflows. We'll also highlight case studies (academic and industry) and even some of my projects using these models, giving a glimpse into the future of automated bioimage analysis.
Meta's AI team introduced SAM in 2023 as a general promptable segmentation model that can delineate any object in an image given minimal prompts (points, boxes, etc.). Trained on over a billion masks, SAM boasts broad generalization. Researchers wasted no time testing SAM on biomedical data. Early studies show a mix of promise and limitations: SAM achieved impressive zero-shot segmentation on some medical images but struggled on others without fine-tuning. For example, out-of-the-box SAM can outline large, high-contrast structures (like organs or colonies) with minimal input, but for subtle features (faint cell boundaries, noisy microscopies), its performance drops.
Adapting SAM to specific domains has been a key focus in 2024–2025. Liu et al. introduced MedSAM, fine-tuning SAM for medical imaging to bridge the gap between natural and medical domains. This year, a team led by Pape et al. released Segment Anything for Microscopy (μSAM), which fine-tuned SAM on light and electron microscopy data. The result was significantly improved segmentation quality on cell and tissue images, compared to vanilla SAM. μSAM even comes as a Napari plugin for interactive segmentation and tracking, offering biologists a unified, user-friendly tool for various microscopy modalities. These efforts demonstrate that we can harness SAM's foundation model knowledge with modest domain-specific tuning for accurate segmentation of biological structures that were previously "invisible" to generic models.
It's worth noting that SAM is primarily an instance segmentation model – it finds object masks but doesn't label what they are. In biology, we often care about which cell type or colony a segment is. This need is where models like DINO and Grounding DINO come in (more on that below). First, let's see SAM in action in a classic microbiology problem: bacterial colony counting.
Counting and analyzing bacterial colonies on agar plates is a fundamental task in microbiology – but one that historically required either manual counting or training specialized models (e.g., U-Nets or Mask R-CNNs) for segmentation. With SAM, we now have a ready-made model that can segment colonies without any training in specific microbiology images. Researchers have begun exploring this. Ilić et al. tested SAM on the AGAR dataset (18,000 images of Petri dishes with various bacteria species) and found that SAM could indeed detect and mask most colonies in an image zero-shot. In one example, SAM produced around 190 segmentation masks on a single petri dish image – effectively outlining each visible colony in different colors. The model even outputs metadata (e.g., bounding boxes, mask area) for each object, which we can export as a CSV for analysis. This showcase proved that, even without training, a general model like SAM can handle dense microbial images and pull out individual colony regions.
That said, SAM isn't perfect on these images. It may produce some spurious or partial masks (e.g., fragmenting one colony into multiple pieces or merging neighboring colonies into one mask). An important step is filtering SAM's output to discard poor segments. In their colony analysis pipeline, Sidiropoulos et al. used a pre-trained SAM (frozen) to cut out colonies and then filtered out bad masks to avoid artifacts. What counts as a "bad" mask? These are often irregular shapes or blurry segments that don't correspond to a single colony. For instance, SAM might grab a piece of writing on the plate or a colony cluster as one mask. By filtering based on properties like mask area, circularity, and solidity, one can keep only nicely rounded, reasonably sized masks (likely single colonies). In the mentioned study, they explicitly removed SAM masks that looked erroneous before using the rest for data augmentation.
Figure: Examples of SAM's colony segmentation on an agar plate, highlighting the need for filtering. Top: Good segmentations – SAM masks cleanly capturing individual bacterial colonies (mostly circular). Bottom: Bad segmentations – SAM masks that are incomplete, merged, or otherwise inaccurate. By filtering out irregular masks (e.g., non-circular shapes or fragments), one can automatically focus on the correct colony segments.
Our own experience aligns with this: using SAM + circularity filtering on multi-well plate images, we could automatically pick out the top candidate mask for each well (usually the colony or region of interest) while ignoring debris or artifacts. This technique drastically reduces false positives in high-throughput assays. In essence, SAM provides the initial "guess" for every object, and a simple rule-based filter (domain knowledge like "colonies are round") refines those guesses. It's a powerful combo of a general model with a domain-specific heuristic.
One remarkable case study that ties together SAM and downstream detection is a recent pipeline for colony counting. Instead of training a colony detector on limited real images, the authors used SAM to generate synthetic training data. As illustrated in the figure below, the process was: take a handful of real plate images → use SAM to segment every colony → filter out the good colony masks → copy-paste those colony cutouts onto blank agar backgrounds to make new composite images (with known colony locations) → train a YOLOv8 detector on this synthetic dataset. They created thousands of labeled synthetic images in this manner, essentially for free. The only manual step was sorting out a few SAM mistakes (e.g., SAM struggled with very blurry colonies, which required manual removal).
Figure: SAM-driven augmentation pipeline for bacterial colony detection. Real plate images (left) are processed by SAM to extract colony masks (top center). After filtering out poor masks, the high-quality cutouts are saved to a database. New synthetic images are generated by pasting these colonies into empty agar images (bottom center). A YOLOv8 detector is first trained on this synthetic data (bottom) and then fine-tuned on a small set of real images. The result (right) is an accurate colony detection model without needing a large real dataset.
The impact of this approach was striking. With only ~100 real images plus synthetic augmentation, the YOLOv8 model achieved almost the same accuracy as training on a 5× larger real dataset. Specifically, 1000 real images + SAM-augmented synthetic data reached a mean average precision (mAP) just a few points shy of a model trained on 5241 real images. Even with as few as 50 real images, the synthetic data boosted detection performance well above models trained on 50 real images alone. This approach underscores how foundation models like SAM can compress the data requirement for new tasks – enabling good results with far less labeled data by generating additional training examples. For academia and startups alike, that means faster development of vision assays (less time photographing and hand-labeling thousands of examples, more time getting results).
Interestingly, while this pipeline focused on detecting all colonies for hygiene monitoring (just finding any growth), the authors note that they didn't yet exploit morphological features, like colony shape or color, in their detection task. In applications like species identification or assessing colony health, those features matter since they suggest future work could incorporate analyses of colony pigmentation or texture. That's a perfect segue into how DINO and Grounded DINO can help label or filter segments by such visual traits.
While SAM excels at drawing masks, it doesn't tell you what those masks are or how they differ. DINO (and the updated DINOv2) are vision transformer models trained in a self-supervised manner to learn rich image features. In practical terms, DINO learns to encode images (or image patches) into a feature space where similar-looking things cluster together – all without any human labels. In 2023, a team of researchers applied DINO to microscopy images and found it had a remarkable ability to learn cellular morphology without supervision. Doron et al. showed that DINO's features of single-cell images were so meaningful that simple classifiers built on them could distinguish cell types and even subtle phenotypic differences nearly as well as highly engineered, task-specific features. In their words, "DINO, a vision-transformer based self-supervised algorithm, has a remarkable ability for learning rich representations of cellular morphology". These representations were biologically faithful – different DINO attention heads even aligned with different subcellular structures, like nucleus versus cytoplasm, across thousands of cells.
For bioimage analysis, DINO opens the door to unsupervised clustering and labeling of image data. Imagine you have segmented hundreds of bacterial colonies with SAM. Some have red pigmentations, some others have a white pigment; meanwhile, some have smooth edges, and others are more irregular. You might not have labels for these traits in advance. A model like DINO can embed each colony image into a vector, such that colonies with similar appearance group together in feature space. Indeed, researchers have used self-supervised features to cluster cell images by morphology or response to drugs, revealing meaningful groupings without explicit labels. We can leverage this by taking SAM's unlabeled masks and using DINO embeddings to filter or organize them by visual traits. For example, one could automatically separate pigmented vs non-pigmented colonies by clustering the mask crops in DINO feature space, then label those clusters post hoc (or just automatically measure their color if it's as simple as hue). The key benefit is reducing manual labor: instead of inspecting each mask, the model's learned features do the heavy lifting.
In our projects, we found this helpful for things like colony pigmentation detection. We combined SAM with DINO to identify which segmented colonies on a plate were producing a specific colored pigment (a common screening method in synthetic biology). SAM provided all colony regions, and DINO's representation helped spot the oddballs (e.g., only a subset of colonies had a dark red hue from the standard dark purple and those clustered apart from the cream-colored ones in the feature space). This way, we could flag pigmented colonies automatically. Recent research supports this approach: self-supervised ViTs can capture even subtle differences like fluorescent reporter expression or morphological changes due to drugs. Essentially, foundation models can segment and characterize biological samples in a two-step, label-free workflow: SAM handles "where", while DINO handles "what's different".
An even more direct way to filter or label segments by their traits is to use language. Grounding DINO is a vision-language model that extends DINO's detection abilities to work with text queries. It's a zero-shot detector that can draw bounding boxes around objects described by a prompt, like "brown colony" or "clear zone" – even if it's never seen those exact items before. Combining it with SAM (dubbed "Grounded-SAM") allows an almost sci-fi workflow: type what you want to segment, and the models will find and mask it. For example, in "natural" or "commonly human" pictures, you could ask for "cat", and Grounding DINO will localize the cat while SAM segments it precisely. In the bio lab setting, one might prompt "white bacterial colony" versus "red bacterial colony" to have the system pick out colonies of each color and mask them separately with no manual clicking required.
While this is cutting-edge, we are starting to see it in practice. A recent demo applied Grounding DINO + SAM on agricultural images to detect plant seedlings by just describing them, and it worked without any fine-tuning on those images. In our lab, we've experimented with Grounded-SAM for tasks like identifying plate contaminants. By providing text like "fungal colony" or "bubble artifact", the model can attempt to highlight regions that match those descriptions, which SAM then segments. Of course, the accuracy depends on how well the text prompt aligns with the model's learned knowledge. Describing visual traits (color, shape) tends to be easier (e.g., "circular clear areas" might help find antibiotic inhibition zones); whereas very domain-specific terms might not work unless the model has seen similar data in training. Still, this approach of open-vocabulary segmentation is extremely powerful as a concept: it means we can issue high-level instructions to images and get structured outputs with a drastic annotation speed up. As Piotr Skalski noted, combining Grounding DINO's detection with SAM's mask generation can turbocharge image annotation, saving huge amounts of time for building new datasets.
For filtering SAM's many masks, one can imagine Grounding DINO acting as a selector. Given SAM's pile of segments, we ask Grounding DINO (or a similar vision-language model) to "find the brown colony among these" and only keep masks overlapping with the detections. This is a form of multimodal filtering – using language as a criterion for vision, which is especially handy when the trait of interest is easier to describe than to quantify formally. We anticipate more tools and libraries building on Grounded-SAM pipelines since the "Grounded Segment Anything" GitHub project has already assembled demos that integrate Grounding DINO, SAM, and even Stable Diffusion for various "detect and segment anything" tasks. The tech is evolving rapidly, and savvy bioimage analysts can start harnessing it for tasks like colony phenotype sorting, locating specific structures in microscopy slides via text (e.g., "mitotic figure", "necrotic region"), and so on – all without training a new model for each task.
The ultimate promise of these models is AI-driven pipelines that automatically handle routine image analysis in the lab. We're already seeing prototypes of this. In high-throughput screening, researchers can analyze images from multi-well plates or GPS satellites on the fly: for instance, a smart microscope might quickly run a YOLO detector to see if any wells have "hits" (interesting growth or fluorescence), and if so, use SAM to segment the region and measure it to automatedly decide how to proceed in real time. Müller et al. describe this scenario in exactly the following way: acquire a quick low-res image, use a model to decide if something worthy is present, and, if not, skip saving detailed data for that well. Such decision-making could make experiments more efficient (huge data savings when most wells are empty or negative). As they note, while YOLO gives a rough localization (bounding box), "biological objects are typically not square", so you can feed the detection into SAM to get an accurate shape mask for precise measurement. This kind of cascade (detection, then segmentation) is compelling and impactful. We used a similar idea for analyzing phage plaque assays (where viruses create clear spots in a bacterial lawn). First, a detection model spots where there are clear zones on a plate, and then SAM (prompted with those locations) segments the exact cleared area so we can measure its size. In classical analysis of antibiotic or phage inhibition zones, one would do this by thresholding or edge detection on the clear region. However, SAM gives us an instant, accurate outline of even irregularly shaped zones; there is no need to hand-tune intensity thresholds since the model "sees" the absence of bacteria and delineates it.
Incorporating these into user-friendly tools is a current challenge, but progress is steady. For example, IAMSAM, a web tool for spatial transcriptomics, uses SAM under the hood to let researchers segment tissue regions by morphology and then correlate those regions with gene expression data. It allows semi-automatic selection of regions of interest (say, clusters of cells in a tissue), which would have taken far longer to draw manually. On the industry side, even major microscope software are embracing foundation models: ZEISS's ZEN software recently added "super-fast image annotations with SAM" in their cloud platform. This means biologists can leverage SAM's mask generation on their images through a vendor interface, speeding up annotation for further analysis. Startups and biotech companies are also watching closely. The ability to process an entire multi-well plate of images in seconds, segment all colonies or cell clusters, and flag interesting ones (maybe with an AI assistant like "Plato" as one lab automation platform describes) is incredibly appealing for scaling up experiments. The technology is catching up to these needs: efficient versions of SAM (like MobileSAM and EfficientSAM) are emerging to run on everyday hardware, making it feasible to deploy these models in lab settings without requiring an AI specialist for each use.
We stand at a crossroads where general AI vision models trained on internet-scale data meet the specialized world of biological imaging. SAM has shown that a single model can generalize to segment cells, colonies, organoids – essentially anything – with zero or minimal retraining, given the right prompts. DINO and its variants demonstrate that even without labels, AI can learn the subtle visual signatures of biological phenomena, from the morphology of a single cell to the pigmentation of a bacterial colony. With Grounded DINO, we can even speak to our images in natural language, pulling out the information we care about ("find the GFP-expressing cell clumps" or "count the clear plaques").
The future of bioimage analysis will likely be a synergy of these foundation models with domain-specific knowledge. We'll see more pipelines where a general model handles the heavy lifting (segmentation, feature extraction, detection) and a lightweight custom layer handles the specifics (filtering by shape, linking to experimental metadata, etc.). For academics, this lowers the barrier to analyzing complex datasets – you can bootstrap analysis with SAM+DINO and focus your precious annotation time only on refining the outputs. For startups, it means faster development of imaging products (no need to collect a million examples of every new assay; a foundation model plus a few-shot fine-tune might suffice). For recruiters in biotech/AI, it's clear that familiarity with these tools is becoming a sought-after skill: they enable smaller teams to achieve what only big companies with massive training datasets could do a few years ago.
In closing, "segmenting the invisible" is no longer science fiction in biology. We can now literally segment things we couldn't even properly label before. A Petri dish teeming with microcolonies, a multiplexed cell image with dozens of phenotypes – these can be navigated and quantified with the help of SAM, DINO, and their cousins. As we continue segmenting the invisible, we make the once-inscrutable data far more visible and actionable. The hope is that this leads to quicker discoveries (finding that one odd colony indicates a breakthrough mutant), more efficient workflows (automating tedious image scoring), and, ultimately, a deeper understanding of biological systems through the lens of cutting-edge AI. The invisible world is becoming a bit more visible, one segment at a time.
Kirillov et al. "Segment Anything." ICCV 2023 – Introduction of SAM.
Archit et al. "Segment Anything for Microscopy (μSAM)." Nature Methods 22, 579–591 (2025) – Fine-tuning SAM for bioimages.
Ilic et al. "Analysis of Microbiological Samples using SAM." (Conf. paper, 2023) – Applying SAM to segment colonies in AGAR dataset.
Kehl et al. "SAM-based Synthetic Data Augmentation for Colony Detection." Appl. Sci. 15(3):1260 (2023) – Pipeline using SAM to generate synthetic training data for colony counting.
Doron et al. "Unbiased single-cell morphology with self-supervised vision transformers." bioRxiv 2023 – DINO captures rich cell morphology features without labels.
Grounding DINO GitHub – Open-vocabulary object detection model (2023). See also the Grounded-SAM project combining text prompts, detection, and SAM segmentation.
Skalski, "Zero-Shot Image Annotation with Grounding DINO and SAM." Roboflow Blog, Apr 2023 – Tutorial on using Grounded DINO + SAM for faster dataset labeling.
Pape et al. "IAMSAM: Image-based analysis of molecular signatures using SAM." Genome Biology 25:290 (2024) – Tool for spatial transcriptomics using SAM to segment tissue regions.
Zoccoler et al. BiAPoL Blog, Feb 2024 – Tips on combining YOLO detectors with SAM for better segmentation in microscopy; introduces micro-SAM.
Zeiss ZEN 3.11 Release Notes (2023) – Mentions integration of SAM for image annotation in microscopy software.