Current state in ecommerce hybrid retrieval
In the past 2–3 years, vector search has gained significant attention in ecommerce, primarily because it claims to offer a solution to a major challenge in keyword-based search known as the language gap or, in more technical terms, proximity search. The language gap refers to the disconnect between the words that a user types in to formulate the search query and the terms used in product titles and descriptions, which often leads to no or poor search results. At the same time, vector retrieval can infer semantic similarities between queries and documents, making it an effective tool for use as a re-ranking layer in the later stages of search to enhance the relevance of the results.
After stripping away the inflated marketing claims from vector search vendors and so-called domain experts, which our customers and users often echo, we identified a mere two areas of genuine optimization:
- Improving recall and
- Enhancing precision, primarily through re-ranking
While this may seem obvious, defining these two areas broadly allows us to challenge current solutions with alternative approaches or technologies later on. We anticipate this will incite further discussion within the community, which we welcome. Even still, these are the only two aspects to be both technically and economically verifiable. In terms of our customers’ business language, these core areas translate to:
- Finding more (still relevant) results for a given set of queries (improving recall)
- Ranking more relevant items higher in the results list (enhancing precision)
However, the feedback we received from our customers, who tried to productize vector retrieval, essentially boils down to one striking problem: vector retrieval always returns the k closest vectors to an encoded query vector. This is the case, even if those vectors are completely irrelevant.
So the open task is clear: Identify or, better yet, mitigate these situations to significantly increase trust in vector retrieval and enable more effective use of the strengths of both vector and keyword retrieval methods.
Hybrid search: the silver bullet?
It is crucial to acknowledge a recent consensus within the search community: vector search is not a one-size-fits-all solution. While it performs exceptionally well in some query types and retrieval scenarios, it can also fail dramatically in others. This has led to the emergence of hybrid search approaches that combine the strengths of both vector and keyword-based retrieval methods. Although this approach is appealing, our real-world observations highlight a key issue:
Many hybrid methods merely average the results of both techniques, undermining the goal of leveraging the best of both. Consequently, the overall value often falls short of the combined potential of the two strategies.
But that’s not at all, the least, questionable issue in current hybrid approaches.
Low maturity level of hybrid search approaches
For an Ecommerce shop, there are essentially two possible paths to choose from:
- Switch to a new “sophisticated” vector search vendor that has bolt on rudimentary keyword search support
- Stick with an established keyword-based vendor or solution that has recently bolted on vector search capabilities.
In either case, there’s a noticeable gap in quality and maturity between the two retrieval methods and solutions, especially when applied to the retail and ecommerce domain.
1. Lack of essential business features
While working closely with our customers, we’ve noticed that many hybrid search approaches often lack critical day-to-day business features. Essential functionalities like exclude-lists, results curation, and multi-objective optimization are frequently missing, despite being vital for retailers who prioritize not just relevance or engagement, but also margins and customer lifetime value (CLTV).
For example, our customers regularly receive trademark enforcement inquiries, sometimes multiple times a week. Brand or manufacturer lawyers may prohibit retailers from displaying results for queries containing trademarked terms. A case in point is the trademarked phrase “Tour de France”; during the event, many users searched for cycling-related products, but most retailers were forbidden from showing results for these queries. Additionally, established brands typically restrict the display of competing products close to their own, and non-compliance can lead to legal repercussions.
Retailers also need the ability to manage or curate search results. For broad queries, there’s often a need to guide customers through the results to avoid overwhelming them with too many options. In branded searches, retailers can generate additional revenue through promotional placements and Retail Media Ads.
Search result pages are key revenue and margin drivers for retailers. Effectively managing this multi-objective optimization challenge—which includes stock clearance, promoting bestsellers, highlighting high-margin items, and featuring mannequin products—is crucial for their success and is commonly known as Searchandising. Unfortunately, these essential business features are frequently missing in new hybrid search solutions or are added as an afterthought, leading to less than optimal outcomes.
2. Added dependencies and complexity
Hybrid search scenarios introduce many additional dependencies to the retrieval system. In addition to indexing and searching tokens or words, there’s also a need to embed and retrieve vectors, which can significantly slow down the indexing process—often extending it from minutes to hours. This added complexity creates challenges, particularly in near real-time (NRT) search scenarios common in ecommerce and retail, where factors such as new product listings, stock levels, and fluctuating business metrics (like demand) are crucial.
Moreover, fine-tuning and optimizing the models that embed data into vector space is more challenging than it initially appears, especially over time. These systems regularly experience a notable decline in performance as time goes on, adversely affecting business KPIs. This issue is amplified when decisions to implement these systems are made based on a single initial experiment or test 🤦. One contributing factor, in system performance decline, is the use of information-rich multimodal embeddings, which can lead to a phenomenon known as the Modality Gap or Contrastive Gap. In such cases, embeddings do not form a consistent and coherent space but instead form distinct subspaces, despite being trained and fine-tuned to avoid this outcome.
3. Significant increase in cost per query
Due to the added complexity and the substantial increase in computational demands of hybrid systems, many ecommerce shops experience costs 10 to 100 times higher per query. Our aggregated data indicates that keyword-based searches usually cost equal to or less than 0.05 cents per query. However, in hybrid search scenarios, we’ve observed costs ranging from 0.25 to 2 cents per query, depending on the vendor or system used. Although these amounts may seem small individually, they quickly accumulate when processing millions of queries each month, significantly squeezing already tight profit margins. This cost escalation comes at a time when economic challenges make efficiency one of the most crucial factors for market sustainability.
4. BM25, Tokenization and vocabulary size
In many hybrid systems, keyword retrieval often relies solely on BM25, which is not ideal for ecommerce contexts. Critical aspects like model numbers, price ranges, sizes, dimensions, negations, lemmatization, field weights, and bi- or trigram matching are frequently neglected.
Additionally, unlike everyday language, ecommerce data is characterized by highly specific, low-frequency vocabulary. Examples of terms, you’d be hard-pressed to find in everyday language but appear disproportionately often within ecommerce:
- brand names
- model names
- marketing colors
- sizes
- etc.
The issue is that most embedding models used for vector retrieval struggle to capture this kind of terminology. Why is it such a challenge for vector search to include this type of mundane information? Vector search has a smaller vocabulary due to constrained memory and efficiency limitations. It’s no fault of vector search, but rather a question of using the right tool for the job. More on that later.
While we currently manage vocabulary sizes ranging from 1.5 million to 18 million unique entries, most embedding models (vector search solutions) that are performant enough for ecommerce scenarios top out at around 100,000 to 250,000 unique entries. Words or tokens not included in these vocabularies are approximated by their character fragments (ngrams) and their corresponding probabilities. As with any form of approximation, this can compromise result precision.
OpenAI Tokenizer: https://platform.openai.com/tokenizer?view=bpe
5. Domain-adaptation
Vector search, or dense retrieval, can significantly outperform traditional methods, but there’s a catch: it requires embedding models that have been fine-tuned for the target domain. When these models are used for “out-of-domain” tasks, their performance can decline dramatically.
If we have a large dataset specific to a domain, like “fashion items,” it’s possible to fine-tune an embedding model for that domain, resulting in dense vectors that offer strong or decent vector search performance. However, the challenge arises when there isn’t enough domain-specific data. In such cases, it’s possible that a pretrained embedding model may outperform traditional keyword search, but it is highly unlikely.
This is the situation we most often encounter in production environments. B2B ecommerce shops usually have a wealth of data about their products and services but lack comprehensive domain data that could help them better understand customer queries.
6. Multilingual understanding
While many search vendors assert a significant increase in natural language understanding query volume, our production data does not support this claim. We observed a slight upward shift, but not substantially or statistically significant. However, what has notably increased are multilingual queries. Cross-border commerce is becoming more prominent. Multilingual queries originating from foreign languages across Europe now account for about 2-7% of ecommerce search volume. Along the same lines, we see an increasingly diverse mix of queries in English, French, Spanish, Italian, Chinese, Turkish, Arabic, and Russian in our customer logs.
Later in our evaluation, we’ll show how most current embedding solutions, including advanced multilingual cross-encoders, struggle with handling and understanding low-frequency vocabulary (infrequent search terms) across several of these languages.
OpenAI Tokenizer: https://platform.openai.com/tokenizer?view=bpe
7. Combination of hybrid results
As it stands, there is no universally proven “best” approach for effectively combining the results of keyword and vector-based retrieval strategies in hybrid search. Existing methods like Fusion and Re-ranking are evolving, but still fall short of realizing the full potential of hybrid search. This owes largely to each strategy’s distinct strengths and weaknesses, making it challenging to achieve an optimal combination.
- Fusion combines results from different search methods based on their scores and/or their order. It often involves normalizing these scores and using a formula to calculate a final score, which is then used to reorder the documents.
- Re-ranking involves reordering results from various search methods using additional processing beyond just the scores. This typically includes some more in-depth analysis with models like cross-encoders, which are used only on a subset of candidates due to their inefficiency on large datasets. More efficient models, like ColBERT, can re-rank candidates without needing to access the entire document collection.
However, these methods often struggle to dynamically adapt to different query types, resulting in inconsistent performance across various scenarios, which we will show later.
Analyzing current approaches
This seeming divergence between these two methods led us, along with our customers, to question whether there are more efficient approaches that fully leverage the strengths of both retrieval strategies. Preferably, we are looking to create a scenario where 1+1 is equal to or ideally greater than 2. We began by analyzing specific customer use cases. Our goal was to deeply understand the specific problems our customers sought to address when they adopted or transitioned to hybrid search.
We studied how these two improvements affect hybrid search in real-world production. Since searchHub is built on an experimentation-centric approach, we can evaluate performance at global-shop, search-shop, and query-specific levels for our customers. Unlike many vendors who claim hybrid search is the best retrieval method without providing concrete data, we aim to share our findings with our customers’ permission. A big thank you to them for that. Please note that the numbers presented are based solely on hybrid search systems that are not under our direct control, whether developed in-house or sourced from external vendors.
Setting up the analytical foundation
As previously mentioned, we began by identifying and analyzing the specific problems and use cases our customers initially set out to address. We distilled these into the following types of queries they aimed to improve:
To offer some quantitative insight, here is the distribution of query demand and search frequency across these types, based on data from the customer base that participated in this analysis. In total, we identified 3,000 unique queries, representing 83% of the overall search demand from these customers. The distribution of these 3,000 queries across the different query types is shown below.
For each participating customer, we selected the top 100 queries and additionally sampled (random weighted sampling) 100 more from the mid- and long-tail for each query type and customer to evaluate competitive retrieval strategies.
For evaluation, we opted for a straightforward approach using precision@k. Each selected query was compared against the results from different retrieval strategies by counting the number of “relevant,” “strongly related,” and “irrelevant” items in the top 12 positions. For each “relevant” item in the top 12 results, we assign a score of 2, for each “strongly related” item a score of 0.5, and for each “irrelevant” item a score of 0. We then use simple addition to aggregate these scores, which we refer to as “relevance points.” The more relevance points accumulated per query type, the better the performance of the specific retrieval strategy for that type.
We acknowledge that this evaluation method may not align with industry or research standards, but we prioritized feedback, clarity, and ease of debugging for our customers over adherence to strict, unhelpful standards. More importantly, we were able to prove that optimizing according to precision@k had a significant positive impact on the relevant business KPIs CTR, A2B, CR, AOV and ARPU while other standard evaluation methods like NDCG failed to do so.
Recall@12 vs. Precision@12
Recall@12 Analysis
Precision@12 Analysis
The possibility to distinguish different query-types gave us a lot more insight into the pros and cons of hybrid search, as it clearly demonstrated the true potential compared to a global view.
The first table shows that the vector retrieval component of hybrid search excels at enhancing recall when using a multimodal model like CLIP (which can leverage text and image information). This is especially true for query types where recall is typically a challenge (such as type-1, type-2, type-7, and type-8), resulting in significant recall improvements. The table also shows that, even in the ecommerce domain, traditional keyword retrieval can achieve substantial gains in recall compared to the standard BM25 algorithm — from 70.5% to 81.57% with ES OCSS.
The Precision@12 analysis indicates that hybrid systems are not fully achieving their potential. While they improve precision@12 significantly for queries where recall is the main issue (about 15% of queries), they also cause substantial drops in precision@12 (in 12-28% of cases) for many scenarios where keyword-based search alone performs exceptionally well. Hybrid systems may achieve considerable improvements for certain query types, but they often rely on local rather than global optimization. In the end, even the newest technology needs to balance recall and accuracy. There is no one-size-fits-all solution.
Even approaches that combine high-recall vector retrieval with fine-tuned re-ranking often struggle to select or emphasize the most relevant parts when the initial set of retrieved results becomes more ambiguous.
Summary
We learned that a hybrid retrieval system alone cannot determine where or whether a specific query’s performance is improved. With this new information, it is clear that choosing the best retrieval strategy (keyword-only, vector-only, or blended hybrid) for each query should lead to big improvements in overall performance.
Despite the expected improvements in recall, we have also shown that improving precision or even maintaining it is much more challenging. Unfortunately, most marketing and sales efforts tend to focus on recall only when trying to sell you their vector search solution, while our data clearly indicates that precision correlates much more with sales and conversion than pure recall.
Taking all these insights and the other identified areas of improvement into account, we developed a radically different new approach called NeuralInfusion™ to address these gaps. This new approach will hopefully fully harness the potential of hybrid search, while being significantly more efficient. Stay tuned for part 2, where we will explain this new idea in more detail and compare it to other ideas we looked at before.