Building a Search Engine

Techology choices that we can make while building a conversational, multimodal search experience

TL;DR

Search has become an integral part of our daily lives. With major players like Google and Bing, as well as mission-oriented options like Ecosia and DuckDuckGo, a variety of search engines have tackled the challenge of navigating the vast expanse of the internet.

However, these search engines cannot solve the search problem on a smaller, customizable dataset, such as an e-commerce catalog. In such cases, we have access to product and user data. This data can be used to derive personalized search results. In this article, we shall explore the technology choices we can make when building a multi-modal, conversational search experience.

Traditionally, search is thought of as typing something into a text box and expecting results. Several applications have invented their own query languages to enhance the capabilities of what cannot be achieved with a full-text search. The rise of LLMs has promoted search experiences like Perplexity.AI (recently, OpenAI also released their own version), which can tolerate imprecise inputs and provide context that encourages natural exploration.

The APIs of these LLMs are now capable of Function Calling, which allows us to expose custom business APIs to the LLMs and provide a complete in-chat experience.

Words are never sufficient to explain something, which is where adding an image input into the mix can vastly enhance UX and reduce frustration. We can take it a step further by using live audio and video capabilities to remove text input completely from the equation. For the sake of this discussion, we are limiting ourselves to image inputs only.

Technology Choices

Based on our requirements above, we have a few technology choices to make:

LLM Provider and APIs

We have a lot of choices for LLM providers in today’s market:

A note on interoperability: While OpenAI-compatible APIs and libraries like Vercel’s AI SDK can make it seem like switching between LLMs is easy, it often isn’t. I’ve seen GPT-3.5 and GPT-4 behave vastly differently on the same prompt. This might be due to the way the models are trained and fine-tuned. But this is definitely something to watch out for.

We can use the following generic criteria to choose the LLM and provider best suited for our task at hand:

Data Preparation and Storage

We need to make our product data accessible for both natural language and image search. For this task, we can leverage multimodal embeddings such as CLIP to generate text and image embeddings, which can then be used during the search.

There are many vector databases like Milvus and Qdrant that tackle the problem of returning the nearest vector given an input vector. Since we also want our search to include conventional filters (such as product category, age group, etc.), a plain vector database will not suit our use case. We are addressing the Hybrid Search problem, where both vector search and conventional search are used together. Fortunately, many conventional databases now offer vector search extensions:

We can evaluate the database choices using a few key points:

Backend and UI frameworks

Python is the go-to language for all ML-related activities due to the abundance of ML libraries and plug-and-play models, thanks to Hugging Face Pipelines. LangChain also provides primitives and extensions to wrap around LLMs, but it comes with a reputation for being unnecessarily complex.

On the other side, JavaScript is catching up with the state of things with ML libraries like TensorFlow.js and Xenova’s port of Hugging Face pipeline. Vercel’s AI SDK (npm i ai) is also a zero-cost abstraction over multiple LLM providers, offering both backend streaming patterns and frontend hooks for popular frameworks like React, Svelte, and Vue.

At the end of the day, technology choices should be made to provide the best user experience, with frictionless developer experience coming second.

Now that you’ve made technology choices, prepared the data, stored it, and are ready to take user input, let’s say the user is querying for “Shirts to wear for a Hawaii vacation.” The LLM, detecting the user’s intent, responds with a message indicating that the user should wait while it searches for those shirts, and invokes our find function with the following parameters:

Now, based on the context in the request (possibly from a cookie), we know the user ID, which allows us to infer:

We can generate a text embedding and query the product database as follows:

select title, price, image_url, distance(embedding, queryEmbedding) as distance
from products
where gender = 'male'
  and category = 'shirts'
  and tags in ('vacation')
order by distance desc
limit 100

Let’s also assume that we have two additional scoring algorithms:

Running these functions on our DB results, we end up with three scores for each of the filtered products. We need to combine these scores and re-rank the results to show the most appropriate results for the user. Although these scores are numerical and might fall within the same range, they cannot be combined arithmetically since they come from different sources and have different meanings. For example, a likelihood score of 0.5 is vastly different from an embedding distance of 0.5. Even among different embedding algorithms, the same score can represent different levels of similarity.

Rank Fusion

In order to solve this problem, we need to convert these scores to have the same meaning. Converting scores to ranks is a common method employed in these scenarios. We order our dataset by these scores so that the nearest entry appears first, and then take the position of the entry as its rank. In our example, we will create three sets of products ordered by embedding distance, image similarity, and likelihood, respectively.

For every product in the set, we can now assign a score as follows:

score=iRanksetwibi+rank\mathrm{score} = \sum_{i \in \mathrm{Rankset}} \frac{w_i}{b_i + \mathrm{rank}}

This expression gives more weight to the highest ranks, while giving rapidly decreasing weights to later ranks. The constant bib_i dampens the effect of early ranks contributing more weight, whereas the constant wiw_i acts as the weight for the rank set. These constants have to be set according to the business requirements. This technique is called Reciprocal Rank Fusion and is one of the basic techniques to combine ranks. There are other more sophisticated methods that can take user feedback into consideration as well.

After the rank fusion, we get a score for each product that we can use to sort the product set and present it to the user. This product set is obtained based on both regular and multimodal search inputs, along with the personalization factor added in, thus enhancing the UX.

Feedback loop and Maintenance

No system will be perfect on day one. We need to keep an eye on our conversational search experience and measure its impact, for example, using conversion metrics. We can also gather explicit user feedback by asking them to vote on LLM responses and recommendations. Additionally, we should monitor for possible abuse of the conversational experience and detect attacks or jailbreaks by randomly auditing PII-removed user interactions.

TL;DR

Back to Top