VLMs are Coming for Image Tagging
Robotic tagging is at hand
At RDL we consistently find that excellent product tagging is essential for ecommerce discovery: It powers search, SEO, marketing, and more. But historically, quality tagging required costly human labeling or specialized ML teams with labeled training data. This has created a scarcity mindset around tags, where only the most crucial attributes, like product category, get tagged reliably. Retailers sometimes curate custom categories like “Lakehouse weekend” or “Michigan Wolverines Football Game”, but these are often small marketing collections because creating new label over large inventories of items requires cost-prohibitive human labor. Our new product solves this, and we’d like to share how it’s done.
Large VLMs are eating the world
In recent years, pretrained models like CLIP along with fine tunes like Fashion CLIP have lowered the bar to getting useful insights from vision models. But CLIP still has trouble generalizing outside content it was tuned on, so the best way to get reliable results on a particular category is to do machine learning the old fashioned way: Get a labeled training set and fine tune on it.
As AI advances, vision language models (or VLMs, large models capable of interpreting images and text simultaneously) are nearing the point where they can go beyond CLIP and enable highly accurate tagging without fine tuning. While it might seem absurd to run every item in inventory through a prompt, this is now becoming possible even on highly performant models, such as Gemini Flash 2.0, which costs only $0.075 per million input tokens when run in batch mode, or about $1 per 10k images.
Case Study: Vibes Search
Imagine you’re a fashion-forward retailer, and the TikTok trend “Dark Academia” starts coming up in user searches, and you want to tag your inventory to ensure good hits for those searches. Here’s an explainer of the trend if you’re not already familiar:
We got great results for this just by using Rubber Ducky Labs with a simple prompt: “Is this piece of clothing appropriate for Dark Academia?”, plus a generic system prompt that asks the model to rate each item on a scale from 0 to 100. Here are some of our top results:
Not only is it zero shot learning, but we barely even had to adjust the prompt: Gemini already knows what Dark Academia is. This kind of performance is almost impossible to match with other models like CLIP unless they are fine tuned for the particular tags of interest.
Quantitative Assessment: Business Casual
One of our partners, Seattle-based clothing rental company Armoire, has invested considerable resources into tagging their inventory. This provides an opportunity for us to test Rubber Ducky Labs’ performance against other approaches, using the human labels as ground truth. We took several scoring methods and calculated the AUC for each, assessing them as classifiers for business casual, with AUC as an outcome metric. In Armoire’s dataset, 34k items had labels, of which 42% were classified business casual.
To evaluate, we compared several different approaches:
Marqo-FashionSigLIP (a top-performing CLIP variant produced by the Marqo team, using cosine similarity of tag embeddings).
Gemini Flash 2.0 and Claude Sonnet 3.5 (rating each image 0-100 based on prompts).
Claude Sonnet 3.5 running a pairwise comparison tournament in which we calculated Elo after the model directly compared thousands of items.
Our results show that out of the box, zero-shot VLMs dramatically outperform Marqo-FashionSiGLIP (a CLIP-like model that outperformed Fashion CLIP 2 in Marqo’s evals). In AUC terms, FashionSigLIP was close to chance whereas Gemini and Claude produced respectable scores of .729 and .726, respectively. Also the rating system worked better than the tournament method, which is fortunate because it is much simpler.
VLMs with Rubber Ducky Labs gets a good match to Armoire’s human labelers with very little human effort.
We can summarize the results with this table:
Gemini 2.0 Flash is both effective and cheap. Really cheap. So running every image through it costs roughly the same as creating SigLIP embeddings. And in exchange, it’s got excellent out-of-the-box zero-shot tagging.
VLMs are best for precomputed labels
Embedding models like CLIP and SigLIP still have a niche, since they can be computed for new tags on the fly. But without good generalization, the results will be weak. In many cases, it’s easier to just precompute and get a good result. If zero-shot accuracy, simplicity, and low-cost batch processing are your priorities, VLMs dominate.
Who We Are
We’re a small and versatile team that has quickly architected and shipped multiple ML / AI products. Alexandra specializes in engineering, UX, and infrastructure, John specializes in ML / AI data science and algorithms, and we collaborate on architecture and product work. Combined, we have over two decades of experience across fashion AI, B2B ML tooling, and high-scale data infrastructure. Passionate about democratizing AI, we built Rubber Ducky Labs to make state-of-the-art technology accessible for every team.
Talk to Our Team
We’ve love to show you a demo, compare notes on the industry, or integrate this into your workflow. When you’re ready, book a meeting with our team: