Time
Reading Time
7 min read

Understanding Embedding Similarities: From High Dimensions to Human Comprehension

Embeddings are fascinating but notoriously difficult to comprehend. When you upload data to a system like Vantage, it transforms into these giant vectors of floating-point numbers that don't make much sense to humans. So how do we make sense of these embeddings and understand their similarities?

In this article, we'll dive into the concept of embeddings, use a sample dataset called "data barn delights," and explore how to visualize high-dimensional data using a technique called t-SNE. And yes, we have the entire notebook and example available on GitHub for those who want to dive deeper.

Understanding Embeddings

Embeddings are like coordinates in a massively high-dimensional space. Imagine a space with thousands of dimensions where each data point is a vector of floating-point numbers. These vectors capture the semantic meaning of the items they represent. For instance, the words "king" and "queen" might be close together in this space, while "king" and "carrot" would be far apart.

While we can calculate cosine similarity scores between these vectors to measure how similar they are, understanding and visualizing these similarities is a whole different ballgame. Cosine similarity scores give us a numerical value, but how do we visualize these massive dimensional coordinates?

The Data Barn Delights Example

Let's introduce our dataset: "data barn delights," a sample data containing 12 items. Each item has an ID, a color, a department, and a description. Here are a few examples:

  • Grocery: Apple (red), Asparagus (green), Blueberries (blue)
  • Garden: Fertilizer (green), Flower pot (various colors)
  • Dairy: Cheese (yellow), Yogurt (white)

When these items are turned into embeddings, like when uploaded to Vantage, they become these incomprehensible vectors in high-dimensional space. So, how do we make sense of their similarities?

Visualizing High-Dimensional Data

Enter t-SNE (t-distributed Stochastic Neighbor Embedding), a technique that allows us to approximate high-dimensional coordinates into lower dimensions—such as 2D or 3D—that we mere mortals can understand.

The Basics of t-SNE

t-SNE works by converting the similarities between data points into joint probabilities and trying to minimize the Kullback-Leibler divergence between these joint probabilities in the high-dimensional and lower-dimensional spaces. In simpler terms, it maps high-dimensional data to two or three dimensions while preserving the relative distances between points as much as possible.

While not perfect or precise, t-SNE helps us visualize and understand how our data points relate to each other. Think of it as trying to visualize the inside of a balloon animal without popping it.

t-SNE in Action

Let's see t-SNE in action with our data barn delights sample data. We start with embeddings that have 2048 dimensions for each item and use t-SNE to reduce these to just 2 dimensions.

Here's what our t-SNE visualization looks like:

In this graph, each point represents an item from our dataset, color-coded by department: grocery (yellow), garden (purple), and cleaning (red). As we can see, items in similar categories tend to cluster together.

Example Interpretation

  • Grocery items like asparagus and frozen peas are close to each other.
  • Garden items like planting bed and watering can cluster together.
  • Cleaning items like mop and sponge are also near each other.

This clustering is both intuitively and mathematically expected. When we use semantic similarity, we rely on the distance between points in our space to reflect their similarity. t-SNE helps us visualize this concept.

Interpreting the Visualization

Now, let's delve deeper into interpreting our t-SNE visualization. The key is to look at the distances between points. Similar items should be closer together, while dissimilar items should be farther apart.

Example Analysis

  • Grocery Cluster: Items like apple, asparagus, and blueberries are close together, indicating their similarity in the grocery category.
  • Garden Cluster: Items like fertilizer and watering can are grouped together, representing their shared category.
  • Cleaning Cluster: Items like window cleaner and mop are close, showing their relatedness in the cleaning category.

By visualizing these clusters, we can better understand the relationships between our data points. This is essential when we need to explain the results of our embeddings to stakeholders or use them to drive business decisions.

Conclusion

In this article, we've explored the concept of embeddings and their complexities. We introduced the data barn delights sample data and used t-SNE to visualize the high-dimensional embeddings in a way that's comprehensible to humans. By interpreting the t-SNE visualization, we can better understand the relationships between our data points and make more informed decisions.

Remember, while t-SNE is not perfect, it's a powerful tool for visualizing and interpreting high-dimensional data. And don't forget, the entire notebook and example are available on GitHub for those who want to dive deeper into the technical details.

Tag :
No items found.

Light up your catalog with Vantage Discovery

Vantage Discovery is a generative AI-powered SaaS platform that is transforming how users interact with digital content. Founded by the visionary team behind Pinterest's renowned search and discovery engines, Vantage Discovery empowers retailers and publishers to offer their customers unparalleled, intuitive search experiences. By seamlessly integrating with your existing catalog, our platform leverages state-of-the-art language models to deliver highly relevant, context-aware results.

With Vantage Discovery, you can effortlessly enhance your website with semantic search, personalized recommendations, and engaging discovery features - all through an easy to use API. Unlock the true potential of your content and captivate your audience with Vantage Discovery, the ultimate AI-driven search and discovery solution.

Our Vantage Point

Introducing Vantage Discovery

Mar 21, 2024
Introducing Vantage Discovery, a generative AI-powered SaaS platform that revolutionizes search, discovery, and personalization for retailers, publishers, brands, and more.
Read More
1 min read

Ecommerce search transcended for the AI age

Mar 20, 2024
Explore search engines and how your ecommerce shop can improve customer experiences via search, discovery and personalization.
Read More
8 min read

How Cooklist brought their catalog to life in unexpected ways

Mar 20, 2024
How semantic search and discovery brought Cooklist’s catalog to life and enabled astounding improvements in customer experience.
Read More
5 min read

Let's create magical customer experiences together.

Join us as we create online search and discovery experiences that make your customers feel understood and engaged.