<- all articles

The Generative AI Playground - Part 4: Visual search

Daphné Vermeiren

We’ve been busy experimenting in our Generative AI Playground again! After exploring use cases for multi-agent GPT systems, function calling, and report generation, we had another use case in mind we wanted to experiment on: AI-enhanced (visual) search. In this article, we share an exciting demo in which we combined Google’s newest multimodal AI model, Gemini, with an Elastic database (thanks to our partner, Elk Factory). Let’s dive into part 4 — visual search.

Traditional search systems

Platforms like Google Images have been around for years, allowing users to search for images based on keywords. Think of your favorite webshop, where your preferred item (hopefully) pops up when you search for a certain term in the search bar. Typically, setting up such search functionalities involves manually tagging each image or product with relevant keywords and metadata; a time-consuming and sometimes inaccurate process.

So, what if you could search for images without these predefined tags? What if you could simply describe what you’re looking for in natural language, and the system could understand and retrieve exactly that, without any human intervention?

Our approach: beyond manual tagging

Imagine: your company has released thousands of products in the past decade. The only problem: you don’t have a structured database of all product packaging and its specific information. And you need it. Traditional methods would falter here — it would simply take too much time and effort to create such a database — but our approach transforms this challenge into a seamless experience.

For demonstration purposes, we decided to use a database of 16.000 clothing product images (shirts, hoodies, blouses, …). For these 16.000 images, we set out to automate the tagging process with Gemini’s multimodal capabilities:

  1. Generating structured descriptions: Gemini extracts structured data like specific fields and values based on the image. Think of fields like clothing type, color, pattern, text, …
  2. Generating textual descriptions: Gemini also generates a natural language description of the image.
  3. Data merging: These two layers of data are then combined into a JSON file for each image, creating a rich, searchable dataset.

The result? A robust visual database that resides in Elastic, ready for querying, with minimal human effort.

Powerful search with Elastic and Gemini

Once the data is embedded in Elastic, the real magic begins. For this demo, we wanted to explore the capabilities of both Elastic and Gemini when it came to search. So, we tested out two approaches:

For our first approach, we utilize Elastic’s ElserV2 model to create embeddings from the textual description, allowing for nuanced search capabilities. This system embeds the user’s search query to find the closest matches based on the image description. For the second approach, we send the user query to Gemini, which in turn transforms it into structured search fields, which are used to retrieve the relevant images based on these specifics. Both approaches work amazingly and extremely fast.

The result is a system where a user can describe what they want in natural language with a high level of detail, specifying parameters such as type, color, text size, actual text, pattern, logo, and so on. The system, in turn, accurately returns only the relevant items. Think of queries like “I am looking for a red striped t-shirt with large text on the front.”

As a bonus, these search queries work seamlessly in a multilingual environment. For example, one could just as easily say, "Cerco una maglietta a righe rosse con un testo grande sul davanti" and get the same results!

Visual search application

Unlimited possibilities

This experiment is not just about showcasing AI’s capability to enhance image search. It’s about showing businesses they can now unlock historical data, provide richer customer experiences, and streamline their operations—all through the power of AI. To show the possibilities, we translated this experiment into some other use cases. Of course, this is a non-exhaustive list of what we could create:

  1. Retail and e-commerce: AI-enhanced visual search can transform how customers find and purchase products. When making this use case even more multimodal and adding image search as well, shoppers could upload images of items they like (from their Pinterest board, for example) and the visual search tool can find similar products available in store.
  2. Automotive industry: Visual search can streamline the process of identifying automotive parts, making it easier for mechanics and service centers to find the right parts. Users can simply take a photo of a car part, and the system can use visual search to identify it accurately and check for compatibility, availability, and specifications.
  3. Digital archives and libraries: For digital archives, museums, and libraries, visual search can enhance the way users find historical documents, artworks, and other archived materials. Researchers can upload an image of a part of a document or artwork to find similar or related items, facilitating better access to resources and enriching academic research.


Tackling visual search this way changes the way we set up search systems for e-commerce and other use cases. By using multimodal LLMs to our advantage, we can skip a lot of manual work and set up a flow that is even more powerful than it was before. As we continue exploring different technologies in the Generative AI Playground, we're excited about the endless possibilities these technologies bring to our lives. On to the next!

Written by

Daphné Vermeiren

Want to know more?

Related articles