Semantic Search using Matryoshka Embedding Vectors

Reading Time

10 min read

Semantic search is an advanced search technique that aims to improve the accuracy and relevance of search results by understanding the meaning and context of the search query rather than relying solely on keyword matching. Unlike traditional search methods, which focus on finding exact matches to the words in the query, semantic search seeks to understand the intent behind the query and the contextual relationships between words.

Large Language Model (LLM) technology has significantly advanced the field of semantic search. LLMs generate embedding vectors, which are dense vector representations of words, phrases, sentences, or even of entire documents. These embeddings capture semantic meaning and relationships, enabling the search engine to find relevant results based on meaning rather than just keyword matching. Unlike traditional embeddings, LLMs produce embeddings that consider the entire context in which a word or phrase appears, leading to more precise search results.

Each document is converted into a fixed-length vector in a high-dimensional space, where semantically similar texts are represented by vectors that are close to each other. During a search, the query is converted into an embedding vector. During the ‘Matching Phase’, the semantic search system compares this query vector with the embedding vectors of documents in its index. Documents whose vectors are close to the query vector in the embedding space are considered relevant. Various mathematical measures, such as cosine similarity, Euclidean distance, or dot product, are used to quantify the closeness between vectors. A higher similarity score indicates a higher relevance of the document to the query.

It follows that embedding vectors are a central component of modern semantic search. Vector databases (Vector DBs) are specialized databases designed to store, manage, and efficiently query these embedding vectors. It’s important to note that the high-dimensional (e.g., 512, 1024 dimensions or more) aspect of embedding vectors leads to significant storage and computational challenges for Vector DBs. So much so, finding approximate solutions to semantic search queries which significantly reduce the storage and computational requirements is a prominent research area for Vector DBs. Approximate solutions may be found by reduced vector precision, and ANN (approximate nearest neighbors).

In this article, we focus on the memory requirements of Vector DBs. It’s typical to pre-load and lock the entire Vector DB into RAM so that the ‘Matching Phase’ executes with minimal latency. Thus, Vector DBs typically require substantial RAM, which is an expensive resource. We explore techniques to reduce the memory footprint of Vector DBs, by storing only parts of the embedding vectors in RAM. We explore the impact on the accuracy of search results and the query execution time, due to this reduced RAM footprint.

Matryoshka Representation Learning (MRL)

Matryoshka Representation Learning (MRL) constructs embedding vectors by embedding information at multiple granularity levels within such vectors. Unlike traditional embeddings that may lose their meaning when truncated, MRL ensures that each sub-vector retains useful information, even when dimensions are reduced. MRL is inspired by the nesting concept of Russian Matryoshka dolls.

Training with MRL starts with lower (coarser) dimensional sub-vectors and progressively moves to higher (finer) dimensions. This approach ensures that each sub-vector is a meaningful representation on its own, capturing broad features initially and refining these representations with more detailed features as dimensionality increases. This method is analogous to crafting each smaller doll within a Matryoshka set with precision, not just the outermost one.

In this article, we evaluate the trade-offs between a query’s execution time, and the accuracy of its search results, when using MRL-based embedding vectors as a vehicle to a reduced RAM footprint.

Experimental Approach

Our experiments were performed on a Cooklist collection, with roughly 120,000 documents. The embedding vectors for these documents were constructed using the text-embedding-3-small LLM from OpenAI, which generates embedding vectors of length 1536. The model’s documentation specifies that this is a MRL-capable model, wherein the generated vectors may be partitioned into 2 tiers with 512 and 1024 elements in the 1st and 2nd tiers, respectively.

A sample set of 1,000 unique customer queries was used. These queries were also converted to embedding vectors using the same LLM. Each of the 1,000 sample queries were matched against the 120,000 documents, to find the top N results, based on their cosine similarity scores. N values of 10, 20, 50, 100, and 120 were evaluated. These constitute the baseline search result set, where the full embedding vectors were used, without any tiering related approximations. The computational complexity is roughly 120,000 cosine similarity operations, on vector pairs of length 1536.

Next, we evaluate the 2-tier embedding vector configuration, with T1 elements in the 1st tier and T2 in the 2nd tier (note that T1 + T2 = 1536). The computation proceeds in 2 phases. In the first phase, the computational complexity is roughly 120,000 cosine similarity operations, on vector pairs of length T1. The top-scoring results from the first tier (R1 in number) are then shortlisted for further processing. In the second phase, the computational complexity is R1 cosine similarity operations, on vector pairs of length T2.

We compute different sets of search results as we vary R1. For each of these result sets, we evaluate the Percentage Recall with reference to the baseline result set. We use R1 values of 100, 500, 1000, 5000 and 10000. The number of results retained from the 2 tiers are specified as a R line (Example: R 5000 120).

Although the text-embedding-3-small model’s documentation states that the tiers may be split as (512, 1024), we have experimented with three different tiering splits (T1, T2):

• As specified by the documentation: (512, 1024),

• With a really thin Tier1: (64, 1472), and

• Intermediate split: (256, 1280).

We present our findings as graph plots - using average recall percentages, as well as box plots of the recall percentages, to clearly highlight the extent of variations from the averages.

In addition to the Recall percentage, we also report the query execution times for the different experiments, so that we may evaluate the trade-offs between result quality and execution speed.

Results

A. With a Tiering split of (512, 1024)

Figure 1: Recall Characteristics with Tiering Split of (512, 1024)

Figure 2: Recall Characteristics as a Box Plot for (512, 1024)

Figure 3: Query Execution Time for Tiering split of (512, 1024)

We observe that the Recall Percentage is impressive for the case of R=5000 120 (i.e. retain the top 5000 results from the 1st tier, and process only these 5000 results for the 2nd tier, finally retaining the top 120 results based on their full score computation). The execution time is cut down to 50% from the Baseline, with this setting.

B. With a Tiering split of (64, 1472)

In the prior experiment, the embedding vectors were split into 2 tiers at the 512-element mark, based on guidelines from OpenAI about the model configuration. We now present results from a tier configuration of (64,1472).

Figure 4: Recall Characteristics with Tiering Split of (64, 1472)

Figure 5: Recall Characteristics as a Box Plot for (64, 1472)

Figure 6: Query Execution Time for Tiering split of (64, 1472)

It is clear that the Recall performance is extremely poor, although the runtime is significantly faster. We conclude that we don’t have the flexibility to split the tiers at any arbitrary point, and we must stick to the tier configuration specified.

C. With a Tiering split of (256, 1280)

We now present results from a tier configuration of (256, 1280).

Figure 7: Recall Characteristics with Tiering Split of (256, 1280)

Figure 8: Recall Characteristics as a Box Plot for (256, 1280)

Figure 9: Query Execution Time for Tiering split of (256, 1280)

We observe that the Recall performance is better than the (64, 1472), and closer to that of the (512, 1024) tier configuration. The running time is also in between these 2 prior experiments.

D. With a Tiering split of (512, 1024), without locking Tier 2 in RAM

Now that we understand the impact on accuracy from the tiering configuration, we proceed with the 2 tier architecture of (512, 1024). We remove the 2nd tier from RAM, thus freeing up a lot of RAM space and reducing memory pressure. The running time does increase, as expected. This is the impact of lower RAM usage.

Figure 10: Tiering split of (512,1024) with R 10000 120. Tier 2 is not in RAM

Figure 11: Non-uniform *Query Execution Times when Tier2 is not in RAM. Configuration: (512, 1024),* R10000 120

Further, we now observe huge variations in the query execution time. The initial few queries take very long – up to as much as 4 seconds, and the latency slowly drops down to a steady state of under 50 ms.

This behavior is not difficult to explain. As the documents in the 2nd tier are accessed by queries, they gradually get loaded into memory and stay there. Thus, the initial set of queries run slower, and the runtime slowly decreases as more pages get into RAM. For the case of larger R, this stabilization happens sooner, because more pages get into RAM for each query.

Note: This huge variation is not seen when the vectors are locked in RAM. The execution time for all queries were roughly similar.

D. With a Tiering split of (512, 1024), without locking Tier 2 in RAM, and preloading Tier 2 Pages

As a final tweak, we try to eliminate the initial spikes in query execution time. We preload the Tier 2 pages into RAM, but avoid locking them into memory. The preloading is done at index load time. Over the course of time, as memory pressure builds up, some of these pages which are not frequently used, will be paged out by the Operating System’s memory manager. The experimental setup did not simulate this page-out, though. We observe that the spikes are eliminated. The query execution time characteristics are similar to that of Figure 3, where both tiers were locked into memory. Each query takes roughly the same amount of time, and the spikes reported in Figure 11 are not seen.

Figure 12: With Tier2 pages preloaded into RAM

Figure 13: Uniform Query Execution Times when Tier2 is preloaded, but not locked in RAM

Conclusion

Based on the experiments above, we observe that a 2-tier architecture with just 5000 results from the 1st tier can generate results that are very close to the Baseline. By locking only the first tier into RAM, we can reduce memory pressure on the system, while impacting the query execution time only moderately. When combined with a preload pass for the 2nd tier (at the time of index load), the huge latency spikes encountered by the initial few queries would also be avoided. The 2nd tier must not be locked into RAM, though. Based on memory pressure, many of the infrequently accessed pages from the 2nd tier will eventually get paged out, achieving a balance between memory pressure and query execution time.

References

PAPER: https://arxiv.org/abs/2205.13147

• OPENAI IMPLEMENTATION: https://openai.com/blog/new-embedding-models-and-api-updates

‍

Tag :

Matryoshka Representation Learning (MRL)

Experimental Approach