John McDonnell 4/2/25 John McDonnell 4/2/25

VLMs are Coming for Image Tagging

As AI advances, vision language models (or VLMs, large models capable of interpreting images and text simultaneously) are nearing the point where they can go beyond CLIP and enable highly accurate tagging without fine tuning. While it might seem absurd to run every item in inventory through a prompt, this is now becoming possible even on highly performant models, such as Gemini Flash 2.0, which costs only $0.075 per million input tokens when run in batch mode, or about $1 per 10k images.

VLMs are Coming for Image Tagging

Let’s Talk