Table of Contents
The Promise and Limits of Pure Vector Search
Semantic search powered by text embeddings has revolutionized information retrieval. By converting queries and documents into high-dimensional vectors, embedding models capture semantic meaning in ways that traditional keyword search never could. A query for "car" can match documents about "automobiles" and "vehicles" without explicit synonyms. Contextual nuances are preserved, and conceptual similarity drives relevance.
Pure embedding-based search fails in subtle but critical ways — and these failures are not exclusively a product of scale. Research from Google DeepMind proves that the limitations are mathematical: there are queries and document combinations that no single-vector model can handle correctly, even in small collections. Scale amplifies the problem.
The Fundamental Problem: Everything is "Somewhat Similar"
Vector embeddings operate in a continuous space where every document has some degree of similarity to every query. In small collections, this usually works well — the most relevant documents tend to cluster near the query vector, and irrelevant ones sit further away. But even small collections can have issues: as Google DeepMind research shows, there are document combinations that no embedding model can correctly rank regardless of collection size.
Scale only makes these problems worse:
How an Embedding is Created — Averaging Causes Precision Loss
To understand the limitations, it helps to understand how a text embedding is actually produced:
- The text is split into tokens (words or sub-words)
- All tokens are passed through a transformer model, producing a hidden state vector for each token
- All token vectors are averaged (mean pooling) into a single fixed-size vector representing the entire text
- The vector is normalized to unit length so cosine similarity can be used for comparison
The averaging problem: Step 3 is where precision is lost. Every token — important keywords, stopwords, filler words — contributes to the final vector. The specific, distinguishing terms get diluted into the average of everything around them.
Some models use the [CLS] token or the last token instead of averaging, but the compression problem remains: an entire document — regardless of length or complexity — is forced into a single fixed-size vector. The richer and longer the document, the more information is lost in that compression.
1. The Curse of Dimensionality
As collections grow, the average distance between any two random points converges toward a similar value. In a million-document collection, the difference between the 100th and 10,000th most similar document may be statistically insignificant. The embedding space becomes crowded, and meaningful distinctions blur. Note that this is a separate problem from the theoretical capacity ceiling: increasing embedding dimensions helps with the capacity limit, but does not eliminate the distance concentration problem at large scale.
2. Semantic Drift and Topic Bleed
Embeddings excel at capturing broad semantic categories but struggle with specificity. A search for "Python programming exceptions" might return documents about "error handling in Java," "debugging techniques," and "software testing best practices"—all semantically related but not what the user wanted. The model cannot distinguish between "related topics" and "exact matches."
3. The Named Entity Problem
Embeddings are notoriously poor at handling proper nouns, product names, technical identifiers, and domain-specific terminology. A query for "Apache Kafka" might match documents about "Apache Spark," "message queues," or "stream processing frameworks." The word "Kafka" gets diluted into its semantic neighborhood, losing its identity as a specific technology.
4. Multi-Field Embedding Collapse
Even when documents are indexed with embeddings from multiple fields (title, description, content), the fundamental issue persists. Averaging or concatenating field embeddings creates a "semantic soup" where precise terms from high-priority fields get averaged away by bulk content. A document with "machine learning" in the title and 10,000 words about database administration might rank equally with a focused machine learning tutorial.
Real-World Failure Modes
Consider these common scenarios where pure vector search breaks down:
Product Search in E-Commerce
A user searches for "iPhone 15 Pro Max 256GB". Pure embedding search might return:
- iPhone 14 Pro (semantically similar, wrong model)
- iPhone 15 (missing storage variant)
- Samsung Galaxy Ultra (conceptually similar high-end phone)
- iPhone accessories (contextually related)
The specific model number and storage capacity—critical to the user—are treated as semantic noise.
Legal and Compliance Search
Searching for "GDPR Article 17" in a legal database needs to return exactly that article, not "GDPR Article 16," "data protection regulations," or "European privacy laws." Embeddings see these as nearly identical.
Technical Documentation
A developer searching for "PostgreSQL connection pooling timeout configuration" needs documentation about PostgreSQL, specifically about connection pooling, specifically about timeouts. Pure semantic search might return MySQL documentation, general database tuning guides, or connection pool library comparisons—all related, none correct.
The Hybrid Solution: Semantic Understanding Meets Lexical Precision
The answer is not to abandon embeddings but to combine them with traditional search techniques in a complementary architecture. This is the approach taken by production systems like SANDI-Solr, which implements a sophisticated hybrid search strategy:
1. Entity Extraction and Keyword Filtering
Before executing the semantic search, the query is analyzed using NLP to extract:
- Named entities (product names, technologies, locations, organizations)
- Quoted phrases (exact match requirements)
- Keywords (important terms that must appear)
These extracted elements are used as hard filters on the vector search results. You can find semantic neighbors all you want, but if the document doesn't contain "PostgreSQL," it's eliminated. This solves the named entity problem.
2. Weighted Field Combination
Rather than collapsing all embeddings into one, hybrid systems maintain separate search strategies:
- Keyword search with field-specific boosting (titles weighted higher than body text)
- Vector search with field-level embeddings (title vectors, description vectors, content vectors)
- Score fusion that combines both signals with configurable weights
This preserves the precision of keyword matches while adding semantic depth.
3. Query Understanding and Expansion
Modern hybrid systems use LLMs to:
- Expand queries with synonyms and related terms (when appropriate)
- Identify the intent behind multi-word queries
- Determine which parts of the query require exact matching vs. semantic similarity
Example: A query like "fix 404 errors in nginx" might be understood as:
- Entity: nginx (must match)
- Intent: troubleshooting/configuration
- Semantic expansion: error handling, HTTP status codes, web server configuration
4. Re-Ranking with Context
Initial retrieval using hybrid search casts a wide net, then a re-ranking model (often a cross-encoder or small LLM) re-scores results by reading actual query-document pairs. This two-stage approach balances recall (finding all potentially relevant documents) with precision (ranking the best ones at the top).
The SANDI-Solr Approach
SANDI-Solr exemplifies this hybrid philosophy by integrating:
- Traditional Solr search with BM25 scoring and field-level boosting
- Dense vector search using state-of-the-art embedding models (Qwen3-Embedding)
- NLP-powered entity extraction from queries to identify terms requiring exact matches
- Synonym expansion for recognized entities and keywords
- Configurable score fusion allowing operators to tune the balance between semantic and lexical signals
- LLM-based re-ranking for final precision optimization
The result is a search system that combines the semantic understanding of embeddings with the precision of traditional search, delivering both recall and relevance at scale.
Conclusion: Embeddings are Necessary but Not Sufficient
Pure embedding-based semantic search represents a powerful tool in the information retrieval toolkit, but it is not a complete solution for large-scale search systems. The limitations become apparent at scale: loss of precision, difficulty with named entities, and semantic drift.
The hybrid search advantage lies in leveraging embeddings for semantic understanding while using traditional IR techniques, entity extraction, and lexical filtering to maintain precision. By combining the strengths of both approaches, systems can deliver the "best of both worlds" — finding documents that are semantically relevant while respecting the specific, precise requirements embedded in user queries.
For production search applications, especially those dealing with technical content, product catalogs, or large knowledge bases, the question is not whether to use embeddings, but how to integrate them intelligently with proven traditional search techniques.
Key Takeaway: Modern search platforms like SANDI-Solr demonstrate that the most effective approach combines vector search with traditional keyword matching, entity extraction, and intelligent query understanding to deliver both semantic depth and lexical precision.
Research: Theoretical Limitations of Embedding-Based Retrieval
Extract from: On the Theoretical Limitations of Embedding-Based Retrieval (arXiv:2508.21038v1)
Authors: Orion Weller*,1,2 , Michael Boratko1 , Iftekhar Naim1 and Jinhyuk Lee1 (Google DeepMind, Johns Hopkins University)
Abstract Summary
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models.
Key Finding: This work demonstrates that we may encounter these theoretical limitations in realistic settings with extremely simple queries.
Theoretical Framework
The research connects known results in learning theory, showing that the number of top-𝑘 subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. The study empirically shows that this holds true even if we restrict to 𝑘 = 2, and directly optimize on the test set with free parameterized embeddings.
The LIMIT Dataset
The researchers created a realistic dataset called LIMIT (50,000 documents, 1,000 queries) that stress tests models based on these theoretical results using simple natural language queries ("who likes X?"). The results are striking:
- State-of-the-art embedding models achieve less than 20% recall@100, with the best models topping out at ~65%
- BM25 achieves ~96% recall@100 on the same task — a traditional lexical method vastly outperforming modern neural embeddings
- In-domain fine-tuning provides minimal improvement — the ceiling is architectural, not a training data problem
Key Contributions:
- Introduction of the LIMIT dataset, which highlights the fundamental limitations of embedding models
- Theoretical connection showing that embedding models cannot represent all combinations of top-𝑘 documents until they have a large enough embedding dimension 𝑑
- Empirical validation through best case optimization of the vectors themselves
- Practical connection to existing state-of-the-art models by creating a simple natural language instantiation of the theory
Implications
The research shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation. The results imply that the community should consider how instruction-based retrieval will impact retrievers, as there will be combinations of top-𝑘 documents that current models cannot represent.
Practical Impact: This research validates the need for hybrid search approaches like SANDI-Solr, which combine embeddings with traditional IR techniques to overcome the theoretical limitations of pure vector-based retrieval.
Reference: arXiv:2508.21038v1 - "On the Theoretical Limitations of Embedding-Based Retrieval"
Full paper available at: https://arxiv.org/abs/2508.21038